Tip:
Highlight text to annotate it
X
Welcome to this training session about the use of some scalar functions for accessing
unicode data in your
DB2 for z/OS tables
Today most DB2 for z/OS installations store their data using the
default
single byte EBCDIC encoding
EBCDIC code pages are typically associated with a specific language
This means that a customer in Germany
would for example use code page eleven forty one
and a Greek customer would have to pick code page eight seventy five in
order to be able to store all Greek characters
While there are a number of characters which exist in both
code pages
this is by far not true for all
So what if a German company has customers whose names and addresses require these characters
which are specific to the Greek codepage
Clearly, the information cannot be stored correctly in the German DB2 subsystem
Instead the Greek characters would be replaced by some sort of substitution
characters
Because this is one
not right
and two
more and more non-compliant with in-country laws
DB2 for z/OS customers are considering moving away from EBCDIC encoding
and towards Unicode
The storage of the data in Unicode tables is not a problem at all
but when it comes to its usage
the application programmers and DBAs have to know and apply specific SQL
techniques
one area that needs some special treatment are for example scalar
functions
substr and length
The editor that you see on the screen right now
is the so-called SPUFI interface
that can be used to interactively execute SQL statements on
DB2 for z/OS
Let me first of all show you the CREATE statement that I use to create the
table where I am going to store the unicode data for this tutorial
The table
TAB1
just has a single column
C1
which is defined as character four
for mixed data
using encoding scheme unicode
If I specify FOR MIXED DATA for a character column along with
CCSID unicode
this means that the data stored in this column uses the
UTF-8 transformation method to
determine the hexadecimal representation of each of the
characters that I insert into my table
UTF-8 transformation method
which is the default for unicode tables in DB2 for z/OS
means that each character stored in column C1 requires at least one
and up to 4 bytes for its hexadecimal representation
I am going to create table TAB1 now
TAB1 is now ready for our unicode data
Let us look at a few different characters and the resulting hexadecimal
values they are transformed to
when I insert those into column C1 of my table
The first characters that I insert into my table
are upper case characters A and B
With my select statement
I request the contents of column C1 in two ways
first the character itself
and second the hexadecimal value that is stored in column C1
on the screen we now see the result of our select statement
First of all we see the character A and B respresnted as the characters itself
which was requested by the column cell
Then second, we see the hexadecimal value which are stored in column C1
four one represents the A
four two represents the B
and then we have 2 blanks. Remember, we created the column C1 as character
four
The second set of characters that I am going to insert are the paragraph sign (§)
followed by a B
The paragraph character, if encoded in UTF-8 unicode
needs two bytes
specifically
it is
hex
C2A7
And the B is encoded as hex
four two
So let us run the INSERT now and have a look at the SELECT output
The result of my SELECT now shows the two rows that I inserted into my table
First we have value A and B, which again are respresented by
for one
for two
and then we have the 2 blanks which are
represented by a two zero each
Now the second row
contains the paragraph B
Like I told you before, the paragraph needs 2 bytes for it's hexadecemal
representation. So it ends up as C287
Now the B, again is represented by the hexadecemal value 42 and then there
is one byte left in this character column which shows a blank
Let us assume you need to SELECT the first character only from column
C1
The way to do this for data stored in EBCDIC encoding
is to use the SUBSTR function as shown in my SELECT statement
I execute this statement now
Looking at the results, I see that for the first row in my table
the one that contains character AB in column C1
DB2 only returned the requested first character
that is an A
For the second row that contains paragraph (§)B
the query did not produce the desired result
Instead you see SQLCODE 331, which indicates a conversion error
The reason for this is that the SUBSTR
function does not operate based on characters, but is byte oriented
So if you ask for SUBSTR
of hex
C2A7
comma one comma one
DB2 only returns
C2
which is not a valid hexadecimal value for UTF-8 encoding
This means that if your data is encoded in unicode UTF-8
you must use a similarly sounding, but differently working scalar function
which is SUBSTRING
The SUBSTRING function requires you to tell DB2 how to look at the unicode
data
If you want to look at UTF-8 data character-wise, not byte-wise
you must specify SUBSTRING
C1 comma 1 comma 1 comma
CODEUNITS16
as you can see from the result, DB2 now retrieves the right
information from column C1, which is the A for the first row and
the paragraph sign for the second row
The exact same thing is true for the LENGTH function
LENGTH works byte oriented also
So instead of using Length when working with unicode data
you should look at the
CHARACTER_LENGTH
function, which was introduced
in DB2 version 8
If you find this information helpful
but you need some more basics
and/or deeper understanding on how to work with unicode data in
DB2 for z/OS
you might be interested in attending course CV28
DB2 for z/OS
How to deal with Unicode and Multiple CCSID data
This training can be delivered as classroom course or instructor led online
If you want to learn more about IBM training, please refer to the link
shown at the bottom of this page