Tip:
Highlight text to annotate it
X
Welcome everyone to the DB2 for z/OS best practice webinar. I’ll be covering, my name is
Sheryl Larsen, and I’ll be covering advanced SQL performance insights.
I actually came to IBM with a couple of decades experience of working on customers SQL workload.
And doing a lot of teaching of SQL over the years. I am new to IBM.
I’m the new IBM worldwide DB2 for z/OS evangelist. And I am very proud to be in this position and
happy to be talking to you today about giving you some of my performance insights and attempting to keep you more up-to-date on
your SQL skills. I’m still the current president of the mid-west user group based in Chicago.
I spent a lot of time and a lot of hours with the early start of IDUG. So 25 years ago
I was on the board and the conference planning committee and spent a good decade and became a hall of fame speaker
and volunteer. And I am also on the board of Northern Illinois University Computer Science Alumni Council.
So I can help guide them in their curriculum to make sure they are keeping up with what is going on in technology in terms of IBM’s eyes.
On slide number 3 is the disclaimer and slide number 4 is our agenda.
So I am going to be starting with the DB2 engine components.
And I will give you my stories, my analogies how I think about how those components work together.
I am going to dive deep into the intelligence of the predicate processing. So those are all your where clauses.
And that is what it is really what it is all about, filter, filter, filter. And then I am going to give you optimizer details.
How are those filters handled. And at what point are those filters being applied.
And then I’m going to go back and review some performance rules and show you why every release I get to shrink those rules down.
Early days I had 27 pages of performance rules of the flaming hoops they had to jump through.
Now it is down to 2. And then I will try and get your skills a little bit updated by teaching you on the DB2 11 side some
new group by syntax. Which you’ll find very, very powerful. Very efficient compared to what was available on prior releases.
So on slide number 5. That is a picture of the DB2 engine.
And I am only showing the components. There are 10,000 modules that make up DB2 and over 10 million lines of code.
But here I am just depicting the components that operate and execute your SQL. So this is from SQL digestion,
turning it into, getting it optimized. Either dynamicly or staticly. Getting it stored either directly or in the dynamic statement cache.
And then into the actual execution of that access path. Buffer manager gets the instructions.
Either heads to an index or two or not. Follows the join procedures that have been mapped out and that ultimately the place that
data is scooped off your hardware and actually brought into memory. So the buffer pool area that is kind of watery,
that is your memory. That is just where CPU happens. And this has changed over the years. Stage 1 wasn’t very smart in Version 1.
It could only handle equal to, greater than, and is null. It now can handle approximately 43 conditions. 43 conditions.
So what does that mean? That means the filtering now 43 conditions, or new 40 ones are moved closer to the data.
Closer to your index pages, closer to your data pages. And that makes queries go faster.
Stage 2 component, it just gets what is left over and so whatever isn’t one of those 43 conditions, gets filtered at that point.
Stage 2’s job is also to field sequence. And if it doesn’t field the right sequence through going through an index.
If it is large enough, it will throw it out to a work file and get that sort accomplished. As of DB2 11, more and more small
sorts are being handled right in memory. And that can be done without going to the workfile. So that is pretty much query execution.
So when you turn to slide 6. This picture is depicting an actual query that goes
done in the engine, picking up data pages from either your index pages or the actual data rows.
And bringing them into the buffer pool where you have to picture yourself now a little green elf called stage 1 running
around with roller blades on with a black marker in its hand. And what it has in that black marker are the
filtering capabilities of those 43 where clauses. And so wherever that column lives, whether it is on an index page or data page,
wherever it is consumed, the soonest point it could possibly filter it pulls out its black marker and wipes out,
disqualifies those values. The next thing that happens is stage 1, little green elf, throws the marker to stage 2.
And then stage 2 then has the marker erupt into an unlimited number of filters and it basically looks at what is ever left and
continues filtering. And so that is how those two components behave with each other. So stage 1 is the initial cleanup and stage 2
picks up everything else it possibly can.
If we go to slide 7, here is where the changes come in. So when I first started in DB2, I could write those 3 predicates on my hand to
remember. These were the best conditions to code. Now we have as you see, we have two flavors of stage 1 conditions.
One flavor being indexable which is special. There is 32 of them. And just plain stage 1 not indexable there are 13.
So there is where my 43 come into my black marker. And great news for DB2 11 is new. So at the bottom of slide 7,
we have its gross. So this is the most we’ve gotten in a single release of DB2 in terms of predicates.
So those 32 are special in that they can used at the top part of an index to probe through an index B-tree.
On the right hand side, the 13 that are mentioned, they can’t be used in the probe but as soon as you get down to the leave
page level, any of those conditions can be applied. In addition to the 32. So that is where we get to the total of 43.
if we walk through the 4 points of filtering. The bottom part. You will see that those are the areas that are depicted.
The first one being, I can be used in the probe. That is the 32. Second point of filtering, I get all 43, but I am only on index pages.
Third point of filtering, I get all 43, but I am on data pages. Fourth point of filtering, I’m throwing the marker to stage 2
and it continues to filter any of those other conditions.
So let’s take a closer look on slide 8 of those actual expansion.
As some of you may have realized my math is pretty atrocious. Here is how you get 32 plus 13 is 43.
Some of these listed on this slide are pretty much synonyms. When you read, where substring equals a value,
where substring of a column operator value, those are actually considered one type of predicate.
So that is why the marker only has 43. These do crunch down to about 43. So that explains my bad math.
But what is interesting about these listed, is that many of you have seen these and maybe you haven’t coded these because you were
taught very well in terms of when to use certain conditions when not to use these conditions.
And many of you may have seen it because you bought a package that contains this type of SQL. Well all these used to be stage 2.
They are now indexable stage 1. And I will tell you the two most popular ones are the very top two.
Right, so my substringing of a column and my between range are the top two. Third place comes the date function
and that is going to be I wouldn’t see a single company that would not benefit from DB2 11 taking care of these some where.
Because even if they’ve got smart programmers, some bad SQL can slip through the crack and it is out there executing.
So with the rebind you are going to get, you are going to move up the food chain. When you think about it you are going from the fourth
point of filtering to the very first point of filtering on all these conditions. And then the one on the bottom, the case expression equals
value. That is only one particular kind of case. And that means wherever you have mentioned expression in your case statement
it would have to be the date, year, or substring. It would have to contain those scalar functions.
So those are the only ones that could be performed. So not all case statements are going to be improved from the fourth point to
the first point. But certain specific case statements with either no column references, which there are many of those,
or if they have a column reference it contains those three scalar functions.
So let’s go to slide 9. And let’s see how this works.
I purposely put in here a really poor example of filtering. This is what I nicknamed seriously delayed filtering.
If we take a closer look at the actual where clause that we are investigating, we’ve got a last name filter, we have a role code filter,
and we have a join condition. As well as a case statement. Now most of those filters are done on one table, the C table.
And we are going to assume in this case that DB2 is going to do the join but it is going to go to the C table first.
Now if this is the only index structure available, token number role code. It sees the opportunity of first of
all, I can pick out my values in sequence which will help my join going into the next step.
And I can filter down by role code. But that is the only filter I can do. So it really skips the first stage of filtering,
because it doesn’t have a token number yet. It hasn’t joined to the next table to get that. And then starts doing index screening.
Stage 2 on the role code. Then it has to bend down pick up pages and continue filtering at the third point of filtering.
Where it then finally handles last name like. Fourth point of filtering if on DB2 10, I hand my marker over to stage 2
where that continues filtering. So this is seriously delayed filtering. Fast forward to DB2 11, and let’s make one other change,
let’s move up this filtering in the food chain by adding two columns in that index. Let’s add last name and the sex column into the index.
Let’s make it a four column index. So now I am going to still skip the first step, but I am going to do really powerful index screening
on all three of those conditions as I am cruising through those leaf pages. And DB2 11 will also be able to skip the fourth
stage. Because that filtering, that particular filtering got moved up. So basically I’m going to be step 2 and I’m done.
So that is how you make things go faster. And that is how DB2 11 is going to help things get faster. You have to maintain what goes in that
index. And you need to take advantage of what technology that can be applied at that index.
So let’s move to slide 10 and go over the performance rules.
Well I always started out this one in the past as a performance rule, convert your stage 2’s to indexable stage 1’s if you possible can.
And the bad between clause that I have been pointing out for I’m going to say for a couple of decades has finally been taken care of.
So the optimizer will see that and actually underneath the covers code the indexable conditions and again it moves that filtering up
the food chain. So I have a big X through that, because that one I don’t have to do anymore. I still need to pay attention to the use of NOTs.
You may have noticed or may not have noticed that of the 13 stage 1 predicates, that is where most of the NOT conditions are.
And although they are pretty powerful, if you can all make it positive, and take the Not out, that is a good performance rule.
Provided that the amount of known values for instance in a NOT IN is not too huge or too volatile.
Second rule and third rule, common sense, only the columns that you need, only the rows that you need.
Put all your intelligence into that query. And make it specific, not generic and get the whole row when you don’t necessarily need
the whole row. You really need to think about pruning your select lists. And you definitely want to stick literal values,
especially if you don’t think that value is going to change in the next few years. This is definite a rule for static queries.
Static queries where that information is powerful. If you hard code it, the optimizer knows a lot more about it.
If you parameterize it using host variable, then it is making some guesses, which makes your access path a little more fragile.
If we go to slide 11 you’ll see there are stage 2 conditions. Because remember it is an
unlimited list and this happens to be one of the most popular ones I’ve seen lately. And it probably comes from,
I’m guessing, people that are coding on the distributed side. They are used to, I think trimming, they definitely like that left trim,
right trim stuff. And they are always throwing UPPPER. Well when you do stuff like this to a column in a where clause it is definitely
going to be stage 2. This is complex, and it won’t use an index, unless you have an index on expression for that particular function.
So the golden rule is to leave the columns alone in the where clause. You want to keep them as vanilla as possible unless you are
providing that index on expression. <
Slide 12, 14:55> Going to slide number 12. More performance rules. Try and keep your data types matching
perfect. Even though as of DB2 Version 5 it is tolerating mismatches, that doesn’t mean they are free.
So if you can make it character 4, compared to character 4, keep it that way. Although it can handle character 4 compared to varchar 26,
there is still some conversion when you traverse that index. So that is why this is still a performance rule.
Another rule, do not select columns with known static values. If you put 5 equal to conditions in your query,
there is no need to select those 5 columns. Because you know that wherever that is equal those conditions will be true.
So you will basically be wasting effort scooping out all those values that you already know about and handing to your calling
program. Of particular, unique, these get pretty obscure. And again I used to have 27 pages of these and we are down to just 2
pages. I just happen to advise programmers that it is very rare that you will need to put a local filter in your ON clause.
I mean there are exceptions that you will need during join filtering. But for the most part it is slowing you down.
It is better to get before join filtering and keeping it in your where clause. The final one and this is just being
really nit picky is the order that you type your conditions in does matter. It will not change your access path but it will change your CPU.
If you can move the most powerful filter ahead, physically ahead of the less powerful filtering you are giving yourself an early out.
So for instance if column 4 greater than, is greater than let’s say it column sex, sex greater than ‘F’ and column 3 is
greater than account 99999, if I physically switch those around I’m doing a lot less tests on sex because I’m hitting that early out first.
And that all happens within the leaf pages. So that is why nit picky rule, sequence them within the table most restrictive to least
restrictive. <
Slide 13, 17:13> Going on to slide 13. Some other things you need to keep in mind. When you are using
host variables or parameter markers as many of you do in dynamic queries. These charts haven’t changed for years.
And you don’t have to memorize these, this just gives you an idea when you are using let’s say between filter, between high value
and low value and optimizer doesn’t know. What it does is it looks up your cardinality. So let’s say you are 10
million. If I look across that chart you’ll see it assuming one 10,000th of your table is coming back from that filter.
Now that may or may not be anything near reality. But the optimizer has to assume something.
So just keep in mind when using likes or betweens, or range conditions, that the optimizer is seriously looking for an index for
help. It is thinking the result set coming back is going to be quite small. <
Slide 14, 18:07> So we move to slide 14. This is the review checklist.
So best practice shops this is the motion you have to go through. In fact whenever I am given or sent some SQL that is not behaving
or a program that is not behaving, the first thing I do is look at the program logic. What is it really doing with the table information.
What tables is it going to, how am I processing it, am I using relational technology correctly.
So what doesn’t make that first test are things like select into in a loop. And things like application joins.
That is not the correct use of technology. And that is why you don’t continue to review the SQL beyond that point.
You need to re-architect the program. So let’s say you pass step one and you are using it intelligently.
Now it is a check of the from clause, do I have all the tables that I need mentioned. Do I have all the join conditions mentioned?
Five table join minimum 4 tables in the from clause. Do I have any stage 2 conditions that I can even undo or not conditions
that I can undo. And make it indexable stage 1. Are there any columns in the select list I can omit because
I have them in a where clause, for example. So I can prune that select list down in case I incur a sort and that will make the sort go
faster because it will be even more narrow. Throw in the local filtering sequence, the most restrictive to the least restrictive
and now your query in that program is the best it can be. The last step will be and a topic for another day is to
examine what the optimizer chose for its access path. So that is basically the review list.
Now if we go to slide 15. What I am showing you there is a little bit of history of IBM’s SQL
on the z/OS platform. So what my first SQL class was only about 3 hours long because look what you had in that Version 1 container.
It is a tiny little language, I thought it was very cute when I first was taught that. Fast forward to a few years ago,
IBM really ramped it up as of DB2 9. We are not going to go over everything that is in there just gleam at the amount of
lines of functionality that you see.
When you go to the next slide, slide 16, it is a comparison of DB2 9 versus 9.5 on LUW.
So what is hanging at the top are exclusive to the releases. What is exclusive to Z is on top, what is exclusive to LUW DB2 Linux UNIX
and Windows is on the bottom. <
Slide 17, 20:47> When you go to slide 17 that is DB2 10 versus DB2 LUW 10
and you will see that that center section is growing. But so are the other sections and is because IBM’s motto in
terms of the SQL portfolio is hey each of the development labs which they are in two separate areas.
One is Silicon Valley and one is Toronto, they have different sets of customers and they will code to what
customers really need. And that is why you are going to continue seeing it very strong but there will be growth
on both the top and the bottom. <
Slide 18, 21:23> So we go to slide 18 and that gives us up-to-date where are we DB2 11 versus DB2 10.5
and the items in red are the ones, and this is the way the story goes, a lot of them go into that common area,
meaning they are synonyms. So you can develop on DB2 on a laptop and as long as within that common,
it migrates very easily and perfectly to DB2 Z. It is the other stuff on the extreme parts that make it unique.
So what I do on most of my advanced classes that I have done over the past couple of decades is I have on Slide 19. I have my students rank themselves coming into the class and I literally have had some
people who knew nothing about SQL because they were an IDMS shop or they had some other non-relational database.
As the scale shows zero to ten, they walked in thinking SQL was a new energy drink. And I got that from a client, I have to
admit that wasn’t my own humor. And the scale goes from zero to ten where ten says you walk on water in terms of SQL.
And what I have done is I have listed the features in the entire portfolio the ones that matter for high performance SQL in an order
that I usually teach them in which is the easiest stuff down to the more complicated stuff.
And it is important to note where you are and it is important to take care of your skill level as best that you can.
IBM provides lots of information in its manuals. They are in DB2 11 is up-to-date already out there and I
am going to cover a couple of those grouping-sets rollup and cube in just a couple of minutes. <
Slide 20, 23:13> If we go to slide number 20 you’ll see that the accelerator has also, the DB2 accelerator,
is network attached appliance to DB2 has also gotten some updates. And it has a sweet spot of SQL that doesn’t perfectly match
DB2’s SQL. But DB2 11 has gotten so sophisticated that there is going to be a few less queries being offloaded and then a few
other ones getting on boarded through for instance support for static SQL which didn’t appear until Version 4 and has been
rolled back to Version 3. So some keep sense on DB2 where it should be, and some rolls into the accelerator.
So wonderful technology and that is where the SQL portfolio kind of breaks down with that optimizer deciding
whether to go to the appliance or not. <
Slide 21, 24:13> Slide 21 there is even more analytic
capabilities. I am not going to go into the details of this but just so you know from an SQL perspective you can actually start using
modelling, SPSS models, within your queries which makes it within your transaction. Which makes you answer business questions that
are real time. This is very, very unique technology. Very unique and any company take advantage of it for
example for fraud detection. Another thing that companies are taking advantage of is some great techology using a couple of user
defined functions and a new one and I love the way they describe this, polymorphic. And new one that helps you read big data.
So big data has a lot of definitions, BigInsights is the tool that IBM basically crafts.
A hadoop type infrastructure and lets you store big data very nicely. And these two UDFs as you see at the very bottom can help you
read data from BigInsights into a result file and then you are allowed, reads it into a cvs file, and then you are allowed to join it to your
DB2 data. So here is where you can ask very interesting analytic questions. Not only with your transactional data
but you can also join it to your other data type which is your big data. IBM is also moving to support JSON.
And this is going to be very, very similar to XML support. In fact, in its first blush it is going to storing that JSON information
that many of it comes from mobile right into DB2 as a LOB. And if it is there, sure enough, it is attached to your row you can now ask
your interesting business questions with this extension to that information. So when you think about all that mobile information coming
in, you’ve got now an option of a safer, more reliable, more highly available, place to store that if you so choose. So on slide 22 let’s get to the new technology. So the group by clause hasn’t been touched
since we got it in version 1. It was group-by-name something. I’ll take that back, it did get an enhancement in Version
5 I believe when you could finally group-by an expression. That is about it in terms of, it is a one element group-by.
Grouping sets helps you expand that into incredible technology really to able to code what you need to on a Z platform
on DB2 prior to DB2 11 would be a lot of SQL. And that is because what these grouping set processes do is help you
integrate your detail information and layer it with your summary information. So you can scoop out in a
single result set, detail, detail, subtotal, detail, detail, subtotal, grand total. That is the type of technology that you are
getting and it is one pass. So not only is it wonderful syntax, it actually is going to perform much better than anything you could
have hand coded in prior releases. Quite often this technology was off-loaded to other platforms where the
group by, rolling sets, and cubes was embedded either in an external tool or was handled on a database that
could handle that particular technology. So if we start with the easy one, the grouping set.
The example on the bottom we have select week sales date as week, day of week of sales date, and salesperson.
So this is the reason result from sales table based on person and date. So in the group by clause I’m saying group by grouping sets
and then you list your items one at at a time. What is the first one you want it broken by, what is the second one you want it broken by,
and then ultimately where do you want your totaling to be.
So if we go to slide 23. The first bracket on the top right that happens to be your
totals per sales person. The middle bracket, that is your totals by day. And the third bracket is your totals by week.
And if this is your data, you’ve had a really lousy year if you only sold things in week 13 and week 53.
But you get my drift on how these total up. So as a programmer you have to know where those NULLs are going to pop up.
You have to be able to read them and understand that point where you have a subtotal or a final. <
Slide 24, 29:10> So now let’s look at group by rollup on slide 24.
It actually is kind of a macro for grouping sets. It is a quick and easy way to get a multiple intersection.
For instance, group by rollup ABC is also equal to if you love typing group by A, B, C plus group by A, B plus group by A, plus the grand
total. So it is nice that we don’t have to code that. So the same example again. This time we are using the group by rollup command.
And that is on the bottom of 24. The results start at 25. <
Slide 25, 29:52> The whole thing is on 25. And again I have the outer most subtotals
and then they move into the next level and then the next level and at the very bottom you might
get a value or NULL, NULL, NULL and grand total of 153. So that’s pretty nice. <
Slide 26, 30:16> So let’s take it one step further and a lot of sites need and do a lot of cubing.
And this basically on slide 26. Think of it of moving that cubing workload to DB2 11 which can now do these rollups.
And now you’ll see the cube. The cube is simply the rollup on steroids. So it is ABC plus AB, plus AC, plus BC, plus A, plus B, plus
C, plus grand total. So this is every way that you could look at those intersections.
And if we turn to slide 27 you’ll see that is how we depict that cube.
So one of the dimensions is day, one is sales date, one is week. And at every intersection point I will get a subtotal. And things roll up. <
Slide 28, 31:08> As we go to slide 28 you’ll see the rollup ends up to the very bottom which is the grand total
of NULL, NULL, the person, and what they have and then the grand total of everything that is sold.
Which is the bottom line, NULL, NULL, NULL, and the 153 again. <
Slide 29, 31:29> So to wrap things up. Hopefully that was impressive technology for you.
I’m hoping that you can take advantage of it as soon as you possibly can. It would be very impressive for example for you to source the
need, or aware this is happening outside the DB2 Z platform. And consider moving it inside in terms of the group by, rollup, and cube.
And best practice, best practice shops their SQL skills are really close to up-to-date.
If not exceeding when they go live. And that means a continual education on these new types of features.
Plenty of places to get that education through IBM and on other vendors and manuals and all kinds of stuff.
And that is what best practice shops do. They also have a culture geared toward, “hey I am responsible for this SQL
and I’m going to make it the best it can be”. It has to be something the developer decides.
And just in case things might go through, fall through the cracks, they have some kind of review policy.
To make sure that maybe if their skills weren’t up-to-date we can catch it before it goes into production.
And then just in case we didn’t catch it, let’s monitor on a regular basis so that we don’t have anything
that goes in and causes some real SQL and CPU spikes because it wasn’t the best that it could be.
And then the final thing that I’m seeing and I’m encouraging companies to do is to move
the access path initial review into the developer’s hands. Now they are going to need some tools for this.
Of course IBM provides one, Data Studio, which is free. So happy to mention that. As well as there are for purchase products.
And what this process does is it takes this off the plate of the DBA as first line of defense.
And moves it down to the developer who is coding SQL. Probably has the SQL on a screen.
Probably using Rational or some other development tool and can easily copy it and paste it into a Data Studio screen
for that analysis. And that will be a topic for another day. I or many others out here have already be to walking you through on how to
use that very wonderful free tool. So I hope this has been valuable to you. And I encourage you also to,
I’ll mention another place for SQL. Is in joining IBM, I have donated my own Advanced SQL training that I have used for
decades. It is the Advanced SQL class and also my tuning class. And IDUG is graciously is hosting those items if you
search for me on IDUGs web site it will take you to a place where you can download, and they are not PDF files,
so the actual training materials that I have been using for decades and I left them updated to DB2 10
and there is a third file that actually covers a real high level DB2 11.
But there is many more details coming from more updated training. But I at least initially encourage you to go there,
download it, and start getting your shop skills up-to-date. And I wish you luck in your future development.
Thank you very much.