Db2 for Z/os Skill Builder - How to deal with scalar functions for data encoded in unicode

Welcome to this training session about the use of some scalar functions for accessing unicode data in your DB2 for z/OS tables Today most DB2 for z/OS installations store their data using the default single byte EBCDIC encoding EBCDIC code pages are typically associated with a specific language This means that a customer in Germany would for example use code page eleven forty one and a Greek customer would have to pick code page eight seventy five in order to be able to store all Greek characters While there are a number of characters which exist in both code pages this is by far not true for all So what if a German company has customers whose names and addresses require these characters which are specific to the Greek codepage Clearly, the information cannot be stored correctly in the German DB2 subsystem Instead the Greek characters would be replaced by some sort of substitution characters Because this is one not right and two more and more non-compliant with in-country laws DB2 for z/OS customers are considering moving away from EBCDIC encoding and towards Unicode The storage of the data in Unicode tables is not a problem at all but when it comes to its usage the application programmers and DBAs have to know and apply specific SQL techniques one area that needs some special treatment are for example scalar functions substr and length The editor that you see on the screen right now is the so-called SPUFI interface that can be used to interactively execute SQL statements on DB2 for z/OS Let me first of all show you the CREATE statement that I use to create the table where I am going to store the unicode data for this tutorial The table TAB1 just has a single column C1 which is defined as character four for mixed data using encoding scheme unicode If I specify FOR MIXED DATA for a character column along with CCSID unicode this means that the data stored in this column uses the UTF-8 transformation method to determine the hexadecimal representation of each of the characters that I insert into my table UTF-8 transformation method which is the default for unicode tables in DB2 for z/OS means that each character stored in column C1 requires at least one and up to 4 bytes for its hexadecimal representation I am going to create table TAB1 now TAB1 is now ready for our unicode data Let us look at a few different characters and the resulting hexadecimal values they are transformed to when I insert those into column C1 of my table The first characters that I insert into my table are upper case characters A and B With my select statement I request the contents of column C1 in two ways first the character itself and second the hexadecimal value that is stored in column C1 on the screen we now see the result of our select statement First of all we see the character A and B respresnted as the characters itself which was requested by the column cell Then second, we see the hexadecimal value which are stored in column C1 four one represents the A four two represents the B and then we have 2 blanks. Remember, we created the column C1 as character four The second set of characters that I am going to insert are the paragraph sign (§) followed by a B The paragraph character, if encoded in UTF-8 unicode needs two bytes specifically it is hex C2A7 And the B is encoded as hex four two So let us run the INSERT now and have a look at the SELECT output The result of my SELECT now shows the two rows that I inserted into my table First we have value A and B, which again are respresented by for one for two and then we have the 2 blanks which are represented by a two zero each Now the second row contains the paragraph B Like I told you before, the paragraph needs 2 bytes for it's hexadecemal representation. So it ends up as C287 Now the B, again is represented by the hexadecemal value 42 and then there is one byte left in this character column which shows a blank Let us assume you need to SELECT the first character only from column C1 The way to do this for data stored in EBCDIC encoding is to use the SUBSTR function as shown in my SELECT statement I execute this statement now Looking at the results, I see that for the first row in my table the one that contains character AB in column C1 DB2 only returned the requested first character that is an A For the second row that contains paragraph (§)B the query did not produce the desired result Instead you see SQLCODE 331, which indicates a conversion error The reason for this is that the SUBSTR function does not operate based on characters, but is byte oriented So if you ask for SUBSTR of hex C2A7 comma one comma one DB2 only returns C2 which is not a valid hexadecimal value for UTF-8 encoding This means that if your data is encoded in unicode UTF-8 you must use a similarly sounding, but differently working scalar function which is SUBSTRING The SUBSTRING function requires you to tell DB2 how to look at the unicode data If you want to look at UTF-8 data character-wise, not byte-wise you must specify SUBSTRING C1 comma 1 comma 1 comma CODEUNITS16 as you can see from the result, DB2 now retrieves the right information from column C1, which is the A for the first row and the paragraph sign for the second row The exact same thing is true for the LENGTH function LENGTH works byte oriented also So instead of using Length when working with unicode data you should look at the CHARACTER_LENGTH function, which was introduced in DB2 version 8 If you find this information helpful but you need some more basics and/or deeper understanding on how to work with unicode data in DB2 for z/OS you might be interested in attending course CV28 DB2 for z/OS How to deal with Unicode and Multiple CCSID data This training can be delivered as classroom course or instructor led online If you want to learn more about IBM training, please refer to the link shown at the bottom of this page