Mod - 06 lec - 30 cache aware programming - Contd

In the previous lecture, we completed understand our discussion of the different cache organizations and we started looking at some small programming examples to get a better appreciation of how small changes in a program or in a loop or important loop of a program, might affect the cache behavior the cache performance that the program would benefit from when executed on a computer with cache memories. And we were in the process of looking at a second example which I called vector dot product. Now, all of the examples that we are looking at relatively small in terms of the number of lines of c code, but they may be important in that they could be fairly frequently occurring operations in larger programs and therefore, a proper understanding of such small, but possibly important loops is beneficial to us. So, the second example which we are we are looking at it is what I called vector dot product and in the slide we had looked at the c version of the vector dot product. So, as the name suggested it is a operation on vectors, which are one dimensional arrays of data. In this particular case there are two vectors which are multiplied term for term and the some of the products is the result of the dot product. So, we see the for loop in c where the dot product of the vector a and the vector b is computed by element by element multiplying a of i by b of i and accumulating the sum in a double procession variable, note once again that the both the vectors a and b are of size 2048, but more important they are each element of the vector is of type double procession or double float which means that the size of each element is eight bytes. Now, the way that we were trying to understand this kind of an example, was first of all by ignoring all the instructions though we know that instructions have to be fetched from memory we assume, that the instructions will benefit from a separate instruction cache and that we can concentrate our analysis on the data behavior. Further we will ignore certain pieces of data such as the loop index or a variable like sum, which could quite easily be stored in registers by the compiler. And we are therefore, left only with a references to the array elements a and b which since there are 2048 of each it is unreasonable to assume that all of them can be accessed out of registers. So, now the next step was to understands specifically, what this program what this loop would amount to in terms of load and store instructions in other words from the perspective of the cache for the memory system what does this program mean. And we analyze it in terms of what I have refer to as a reference sequence. The sequence of memory references that result when this loop executes and looking at a from the from the at the level of the arrays a and b, we realize that in the first iteration through the for loop. First of all a of zero will have to be loaded from memory into a register, then b of zero would have to be loaded from memory into a register subsequence, they would be multiplied, but we are not concerned about the multiply or the add or the other instructions we are concentrating only on loads and stores. Therefore, the reference sequence would simply have load a of zero load b of zero in the first iteration, in the second iteration of the for loop load a of one load b of one and so on. So, all of the 2048, iterations would look a long glance of what we see here each iteration involving a load of an element of a and a load of corresponding element of the vector b. So, to proceed with our analysis we have to make some assumptions about what the addresses of the elements of a and b are. And the base assumption that we made is the compiler will assign an address to the first element of a into the first element of b and the subsequent elements of a and b would be in contiguous locations for example, a of one would be in the location next to a of zero. So, since each element of a is of size eight bytes if a of zero is at the address hex a is zero zero zero the day of one would be at the address hex a zero zero eight, which is eight bytes after or it is a contiguous in memory to the location of a of zero. And similarly, for all the other elements we similarly, assume, that the the compiler assigns b the vector b to some other address range and this information will be important to us in doing the cache in analysis. Since, you will remember that to understand what is happening in the cache we have to keep track of the number of hits and misses when this program executes or when the sequence of memory references is sent to the cache the memory system. And therefore, we had to do the analysis based on how the cache hardware views an address and for the cache that we are considering in a in the examples, that we did last time we were assuming that there is a data cache of size sixteen kilobytes. The cache is direct mapped it uses write-back at the write-back update memory update policy and the size of each cache block is thirty two bytes. From this we could figure out exactly how the cache hardware views an address. So, if I look at the thirty two bits of an address such as the address the base address a of the base address of a vector b the least significant five bits are used as the offset to identify a particular byte within a block. The next significant nine bits are used index into the cache directory remember the we are assuming a direct map cache and the remaining bits are used by the cache hardware to distinguish between hits misses in terms of which main memory block is currently present in a given cache block. Therefore, this perspective for the particular cache that we are talking about the sixteen kilobyte cache with blocks of size thirty two bytes and using direct map placement is important for us to understand the behavior of the specific program that we are talking about. So, we went about doing the analysis by actually, deriving a table in which we had the different references the references are load a of zero load b of zero load a of one load b of one in the order as shown in the table. And we therefore, each of the references there is a an address for example, the address of a of zero is hex a zero zero zero. The address of b of zero we assuming is hex e zero zero zero the address of a of one as we just calculated is hex a 0, 0 eight and so on. Now, putting this together with our understanding of how the cache views addresses? The cache views addresses as least significant five bits remember these are thirty two bit addresses least significant five bits are the byte offset within a block and the next significant nine bits are the index into the cache directory. So, in understanding what is going to happen within the cache when this program corresponding to these memory references executes, we have to view each of the addresses in the light of this cache perspective on addresses. So, consider a zero zero zero in hex if i take a zero zero and expand it into binary this is what I end up a thirty two bits sequence. Notice that a 0 0 0 each of the hex digits is actually, a four bit value and I have given you only four hex digits which means, that this is really a sixteen bit address. But I am expanding it into a thirty two bit address since that is what the cache hardware is assuming. And I do this by putting sixteen most significant zeros I do not want to change the value of the address therefore, hex a 0 0 0 0 is actually hex of 0 0 0 0 a 0 0 0 which is y I expanded in this way. If I now, look at this thirty two bit version of hex a zero zero zero which is the base address of the vector a in terms of the least significant five bits the next significant nine bits. I find out that the index associated with a zero zero is the value 1 0 0 0 0 0 0 0 0 in binary, which you can compute to be equal to two fifty six. Similarly, I can look at the address the base address of b of zero which is hex e zero zero zero, which works out to the this bits sequence which I have expanded right now. So, the e is 111 0 in hex and the others of other bits are all zeros, but if I look at the index bits corresponding to this address I notice that once again it is exactly the same is the index bits associated with a of zero, which means that in my direct map cache. The block containing vector element b of zero would occupy the same cache block as the block containing vector element a of zero this is a situation, the worst case situation that we talked about when we were worried about the problems with direct map caches and it has arises in this example. But if we continue to look through the sequence of references that we have what is the address of a of one the address of a of one as we saw is eight bytes eight more than the base address of a of zero. And that works out to the bit pattern which is third in sequence here it is eight added to the base address of the array and you will notice it too has an index of two fifty six in decimal and this we knew, because we know that a of one is a neighbor of a of zero and it would lightly be in the same cache block is a of zero and therefore, would have the same cache index. We in fact from our calculation from last time given that the size of each cache block is thirty bytes and the size of each array element is eight bytes. We expect that four consecutive array elements whether it be for a of b four consecutive array elements would be in a single cache block. So, we expected that a of zero a of one a of two and a of three could all be within the same cache block in other words they would all have the same index. And as we proceed with expanding the addresses and looking at them in the light of the cache perspective on addresses, we do in fact see that is the case a of zero a of one a of two and a of three all have the same index. And differ only in their offset bits within the block they are talking about eight different corrections of eight bytes within the block of size thirty two bytes. But the current problem that we have is very clearly after a of zero has been loaded a of zero, if we were assuming the cache starts of empty, then the attempt to load a of zero the first reference made by this vector dot product program would have resulted in a cache miss. And in handling the cache miss the cache hardware would have copied the block from main memory into the cache. Unfortunately the next reference made by this program is to b of zero, which is to will load will also suffer a cache miss and will load to block containing b of zero from memory into the cache. So, that when the time comes to reference a of one next the block which contains a of one is no longer present in the cache it has been replaced by the block containing b of zero through up to b of three. So, the bottom line is that we when we do the analysis for this unfortunate program for the cache under consideration we find out that every single reference is going to be a miss. And some of the misses about we call conflict misses, because they are a result they are not a result of the program starting off with the cache empty, these conflict misses are result of the program making references in such a fashion. That the references of this program conflict with each other the reference to b of zero replaced block, which would have been to used to satisfy the request for a of one and so on. So, the program is conflicting the references of the program conflict with each other resulting in misses and this is not a problem with, how the program started it is a problem with a structure of the program whether the program is written. The bottom line then is this program will actually; get no benefit from the cache it will end up with a hit ratio of zero percent. In other words every single memory reference every single reference made by this program will result in a memory access under nanoseconds or more and none of the references of this program will benefit from a one cycle one nanosecond cache hit. And the source of the problem as we saw was that the elements of the arrays a and b, which are accessed in the same order a of zero followed by b of zero followed by a of one followed by b of one. Actually conflict with each other in the sense that they have the same cache index and in a direct map cache they need to occupy the same location in the cache and therefore, one after the other they replace each other resulting into zero percent hit ratio. So, the solution which we talked about last time was to actually try to amend this problem correct this problem by causing the base address of let say vector b to be adjusted so, that there is no longer this problem of conflicting in terms of the index value with the base address of vector a. And one way that I suggested this could be done is by assuming or using our knowledge of how compilers lightly to work in that compilers of other once which assign the virtual addresses to the different variables the different arrays of a program. And if we assume that the compilers do so by dealing with a declarations as they are encountered and using consecutive memory addresses for consecutive variables as they as they are declared. Then we could cause the base address of b to be change for example, by instead of using a declaration bits as double a of 2048, b 2048, we artificially increase the size of the vector a as out lined over here, to 2052. In other words I increase the size of the vector a by four elements, which means I increase the size of the vector a by thirty two bytes which is the size of one cache block. Which we mean that the address of vector b will go up by thirty two and therefore, its index will change by one that will be the net effect, if you go back to the address interpretation by the cache hardware. So, the base address of b would now be e zero two zero rather than e zero zero zero and its index the index value the cache index value for b of zero would now be 257, rather than 256. And therefore, a of zero and b of zero would not conflict for the same cache block and therefore, when a of zero is loaded they will be a cache miss for cold start reasons, after that when b of zero is loaded there will be cache miss for cold start reasons. And b of zero will occupy different location in the cache subsequently, the next reference to loading a of one will get a will get a hit. Because it is in the same block as a of zero was we get the benefit of spatial locality of reference and the bottom line is there was a hit ratio which is substantially more than the zero percent that we got. And also suggested in alternative way to set this up, rather than artificially increasing the size of the array a if I am assuming that the compiler assigns addresses in order of and encountering the declaration. Then as long as a even if I kept the address of a the size of the declare size of vector a as two thousand and forty eight. If I was to include other variables between, the declaration of a and the declaration of b then I could achieve the same affect of causing the base address of b to b adequately different from the base address of a. For example, in this variant I have a declared as a vector of size 2048, as it should be b is also declared as vector size to 2048, but in between I declare four variables dummy one dummy two dummy three dummy four it is of type double which means that the total space occupied by these four variables would be thirty two bytes. The base address of d one would be hex e zero zero zero the base address if the compiler assigns consecutive contiguous addresses the base address of d two would be e zero zero eight and so on. So, that the base address of b would end up being e zero two zero, which has a sufficiently different index value to not conflict with the base with the vector a. Now, the four variables d one through d four are not variables that my program will use and therefore, even though they may conflict with the vector a. If they had been used the problem does not arise for my for my for this particular program, they variables which are declared and including the program in the merely for the purpose of causing the base address of a of vector b to be such that it does not conflict with the vector a. So, the bottom line in this case would be a substantially, improved hit ratio and you can do the calculation to find out there would be at least seventy five percent great improvement over all zero percent. Now, there are other ways that one could handle this particular problem with this loop and another idea is the idea which is sometimes known as array merging let me just let you know what array merging amongst too. Now, you will recall in the previous version of the program I declared a as a vector of size 2048 and I declare b as a vector of size 2048, but I noted that the way that my program was using the elements of a and the elements of b was that after referring to element a of I it next refer to element b of i. In other words there is this relationship between a of i and b of I, which I could actually cause to be reflected in the declaration of a and b for example, by something like what we have on the screen here. So, rather than declaring a and b as separate vectors here what I have done is I have declared a and b as being elements of a struct structure and I declare an array of those structure elements. Size of that array is 2048, within each element of the struct there is one element of a and one element of b for example, within the element zero of the array I actually have what I use to refer to as a of zero and b of zero and they are package together in element zero of the array by this struct which I have declared. So, now, when I need to right my the vector operation what I have to observe is that I need to multiply the a element of array iteration through this loop I multiply the a element of the a part of the ith element of the array, by the b part of the ith element of the array and then I accumulate them into this some variable. So, it is a minor variant in that the calculation, which is computed is the same as what it was before, but we have change things in that, if you think about the addresses of a and b under this kind of a declaration. Every element of the array will have an element of a and then element of b and therefore, the element of a and the corresponding element of b will be in consecutive memory locations. And therefore, you doing the kind of calculations we had talked about up to now we would of found out that the element of a zero and the element of b zero would be next to each other in memory. And therefore, they would be in the same cache block and therefore, while access to array I of a might be a cache miss access to array I of b would be a cache hit and similarly, when i go to the next iteration first the elements of a and b would be hits and this is why be come up with an estimate that they would be a 75 percent hit ratio. So, this is an alternative perspective on how one could improve from that zero percent hit ratio for the self same operation of computing the dot product of two vectors. Now, in some situation it is may not be a good idea to change a declaration like this, because you will as you aware when you write programs often you would want the program to be readable and that the meaning of the program should reflect some. The natural phenomenon or the physics problem or the actual objective for which the program was being written and by artificially packaging elements of a along with elements of b it may become a little bit harder for the person reading or using the program to understand it. But one should bear in mind that the benefits of doing this may could worthwhile to reassess the readability of the program. And just it remember this is an alternative to the previous technique which had which I had suggested there one carefully at cause the base address of the conflicting vector b to be adjusted, either by adding additional dummy variables in between of a artificially declaring one of the vectors a little bit bigger, than it actually needed to be. Now, moving right along just look at a new example now, the next example I am going to look at a something I will refer to as d a x p y and I will pronouns this as daxpy. Until now the examples, that we used had expanded names which were we could understand the name from the rough understanding of what like vector dot product was here for the first time very clearly this is not going to be the case. But obviously, this is an important operation so important that it has this abbreviated name form. Now, the what the daxpy operation the daxpy program that we are going to talk about does is a double precision operation, where it multiplies a vector x by a scalar value a that is the a x. And then adds to it each element corresponding element of a vector b so, double precision a multiplied by x plus y is what daxpy refers to and one could think of another program which i might called saxpy and as you would imagine this must be referring to a similar operation, but rather than operating on double precision values it operates on single procession or thirty bit floating point values. So, clearly these must be very important operations and we will refer to daxpy later, but the nature of the operation is quite clear once again we have two vectors a vector x and a vector y. Each time through a loop as we are going to see an element of x is multiplied by some scalar a variable value a right so a is not a vector a is a single valued variable and then the corresponding element of the vector y is added to that product producing the new value of that element of the vector y. So, if I had to view daxpy as a piece of c code as we were doing with vector sum the vector dot product and so on this is what I would see. Once again I have the vector x which is declare as double and of sum size once again I am using the size 2048, the vector y also declared is a same size and I have the scalar a is not a vector a is a single valued variable. Each times through the for loop one an element the ith element of x is multiplied by a and added to the ith element of y and this becomes a new value of the ith element of y and this is how the vector y is the value of the vector y is computed. So, this operation I would refer to as daxpy so, once again as always we need to start by understanding exactly what the reference sequence is we are going to ignore the variable a. Since a variable a could very easily be stored in a register there is no need is strength through this loop to load the value of a from a memory location into a register the value of a does not change for the duration of this loop. Similarly, I will assume that the loop index is maintain in register and that is I need concentrate only on the references to x of i y of i X and the additional references time to y of i note that there is I need to load x of I multiplied by a then there is need to load y of i added to this product and then subsequently to store the result into y of i therefore, there are three operations three memory operations in each iteration of the daxpy loop. So, if I think about the references sequence I look back at the first iteration through the loop. In the first iteration through the loop i is equal to zero therefore, I need to load x of zero, then do the multiplication then I need to load y of zero and then do the addition and then I need to store the result into y of zero. Therefore, these are the three operations in the first iteration of the for loop load x of zero load y of zero and store y of zero. Similarly, in the second iteration when I is equal to one I will load x of one load y of one and store y of one and so on for all of the 2048, iterations. Now, once again we need to make some assumption about the base address of the vector x and the vector y as we now call them. But we could actually quite easily short circuit the operation of going through the full calculation on the bases of drawing a table, but if we at this point we said let us assume that we know enough about how to adjust the base address of x and the base address of y. So, that they do not conflict if I in finding the base address of x and the base address of y I found that they end up having the same index, then I could always adjust the base address of y. So, let me assume that i have done that adjusting and now I can actually look at this sequence of instructions and try to understand what will happen in terms of hits and misses. So, I might say and again we are going to avoid drawing the whole table since at this point may be possible to do the analysis quite quickly the first reference is to load x of zero and I might say this is guaranteed be a cold start miss. So, no hope for that then there is load y of zero I am assuming it does not conflict with x of zero, but this too is guaranteed to be a cold start miss, then I store y of zero now, I know for I can be quite sure that since a previous reference loaded y of zero y of zero will be in the cache. When the store instruction is executed and therefore, I can be quite confident that the store y of zero will be a cache hit and so this is a new thing which we did not have in our previous examples, a guarantee that they would be a cache hit because, the same variable the same element was accessed in this case in the by the previous instruction. And move on to the next iteration and I can again see that x of zero x of one x of two and x of three are all going to be in the same cache block and therefore, I would be quite confident that load x of one is going to be a cache hit similarly, load y of one is going to be a cache hit and as always store y of one is going to be a cache hit and I can therefore, proceed with the analysis. But instead of drawing the table as I said let us try to do the analysis in terms of calculating how many hits I am going to get for the loads of x of i how many hits I am going to get for the loads of y of i and how many hits i am going to get for the stores of y of i. I know that each I know that I am going to have to load x of i 2048, times I know that the first one would be a miss but the next three would be hits subsequently the attempt to load x of four would be a miss. So, I now that one out of every four references to x of i is going to be a miss, but the remaining three out of four, because of spatial locality of reference will be hits. Therefore, as far as the 2048, loads of the vector elements of x of i are concerned I can say that they would be 1536 six hits. What I am going to do is calculate the hit ratio in the upper right hand corner I can then move to my assessment of what about the loading of the 2048 elements of y of i and once again I know in the iteration they will be a cold start miss in second iteration, because of spatial locality they will be a hit similarly, in the third and forth iterations they will be hits. And just as in the case of the loads of x of i for the loads of y of i they would be 1536 hits out of 2048 references. So, the number of hits that I expect are 1536 for the references to a and other 1536 for the references to y what about the last sequence of references that I making the stores to the elements of y I know that all of those 2048, stores are going to be hits. So, I just add it 2048 for the guaranteed hits as far as the store instructions are concerned how do I calculate the hit ratio I divide the total number of hits by the total number of memory references what is the total number of memory references that is going to be given by the number of iterations, which is 2048, multiplied by the number of memory references per iteration which is three. Therefore, the denominator of my hit ratio calculation will be 2048, multiplied by three which is six thousand one hundred and forty four. So, I actually know that the hit ratio will be this value and if you do the calculation you will find out that under the assumption that the base addresses of x and y do not conflict in the cache you come up with this quick assessment that the hit ratio will be 83.3 percent. And this was higher than the 75 percent that we got in the previous example, because of this slide difference in the nature of the program where we had one third of all references being guaranteed hits by the very nature of the I am sorry I should be circling the stores the stores are all going to be hits, because of the fact that the element which is being stored was just accessed by the previous instruction which was a load instruction. So, daxpy was quite easy for us to analyze and those no need for us to draw the table we would like to move in this direction what to we what you are observing is that we are moving away from our requirement of actually knowing the base addresses or actually doing the calculation of the index value for each of the elements with this experience of having work through two or three simple examples. And with this kind of quick analysis, we get a very good idea what to expect when the program will actually execute on a given piece of cache hardware. So, we can actually move on to another example, quite rapidly I did want to just remind you that we will be encountering daxpy again that is I want to introduce the word the term daxpy we will be encountering again in the not too distant future. Now, let us move on to a slightly more interesting example this example is an example, in which we are not dealing with ordinary one dimensional vectors any more, but we are actually dealing with a two dimensional matrix and the operation is quite simple we have two dimensional matrices and we want to add them element by element. So, that the structure of the program is as shown over here I have two double precision two dimensional matrices. So, each matrix has 1024 rows each of which is made up of 1024 columns the picture that you can have in mind when thinking of either the matrix a or the matrix b is, rather than when we thought out of a vector we thought of a linear entity when we think of a matrix a two dimensional matrix we think of an element which has many rows. So, the lines which I have drawn are meant to represent the different rows so the 2048, rows which would be numbered from row zero up to row 2047 and each row is comprised of 2048 columns in the columns are numbers from zero up to 2047. So, this is the picture that one could have in mind when thinking of let say the matrix a or the matrix b and any particular element of the matrix is refer to using two entasis the row index and the column index. For example, to refer to that element over there if this is the matrix a then I refer to it as a of zero which means in row zero in other words specifically in column zero. So, the element of row zero column zero of the matrix a and that is notation which use in c exactly the same so, I have two such matrices and i want to add them element by element and this particular example I am putting the result of doing this addition as a new value of the matrix b. So, in this particular case we need to have a w nested for loop and each time through the w nested for loop an element of a the ith the element of a in the ith row in the jth column is added to the corresponding element of b. In other words the element of b from the ith row and jth column the addition is done and this becomes a new value of that particular element of b. In other words the ith row and jth column of b takes on this sum so you will notice that the number of times that this statement within the loop is going to executed is 1024 multiplied by 1024. Since number of elements in this to the each of these two dimensional matrices is 1024 multiplied by 1024. So, as before we can run through this example, by trying to layout the order in which the different references occur. So, I try to derive the reference sequence so and looking at this just bear in mind that we have a situation where there are w nested for loops in other words, if consider the for loop on the outside I will refer to the j loop as the for loop on the outside. So, initially the for loop the j loop will cause j to have a value of zero, then we enter the inner loop which is the i loop in this case so, terminology which I have used this I talk about the j loop as the outer loop and the i loop as the inner loop within the inner loop we start by setting i equal to zero. And therefore, the first memory operation that happens is loading a of 0 0 after that we load b of 0 0 and then we store we do the addition and we store the result in b of 0 0, then we step back and increment i by one so, we go from i equal to zero to i equal to one j is still zero. And therefore, the next vector element of the next matrix element that we deal with is adding a of one zero to b of one zero. So, we have an operation of loading a of one zero loading b of one zero doing the addition and storing the sum into p of one zero. So, if I was to and so on so as I write out the complete reference sequence I have the first iteration through this w nested loop which loaded a of 0 0 0 loaded b of 0 0 did the addition and stored the result in b of 0 0 after that remember that the value of i change to one. Because we were the inner loop will be iterated 2024, times I am sorry this is thousand and twenty four and not two thousand and forty seven so these number should not have been thousand and forty seven but thousand and twenty three. So, in the second iteration through the w nested loop I load a of one zero remember that I has been incremented to one so, it will be a of one zero and then I load b of one zero do the addition and put result in b of one zero and so on. So, there will be 1024 multiplied by 1024 iterations through this and there will be corresponding number of references in the reference sequence. Now, to actually proceed from here things are little bit more complicated then what we talked about in the case of the vector examples that we looked at because when we talked about the vector examples it was very clear. That the assumptions we were making was that the neighbors of a of for example, a neighbor of a of zero would be a of one and the neighbors of a of one would be a of zero and a of two. We assume that the vector elements will lay out consequentially in memory, but when we have a two dimensional structures such as what is shown over here we do have to ask the question on the screen, in what order all the elements of a multidimensional array. Such as these two dimensional array which we are currently looking at stored in memory and you will quickly realize, that there are at least I am sorry there are two possibilities unlike the case of the vector where it was clear that there was only one option consecutive vector elements in consecutive memory locations. Here there are two possibilities you could either have the neighbor of a of 0 0 being a of zero one and the neighbor of a of zero one could be a of zero two alternatively it could compiler could lay out the array elements in memory matrix elements in memory. So, the neighbor of a of 0 0 is a of one zero and the neighbor of a of one zero is a of two zero and so on. So, there at least these two possibilities which is why we will spend the whole slide just giving some terminology to these possibilities. So, this is a the topic which is sometimes known as the storage order of multidimensional arrays in our particular example we are looking at two dimensional arrays, but something similar what happen for three four or higher dimensional arrays. The two options which I just talked about are in fact known as the row major order and the column major order and the row major order was the first, which we had looked at the idea that. The elements of the first row of the matrix are stored in consecutive memory locations and they are then followed by the elements of the second row of the matrix etcetera, as I had suggested to start off with. Remember the first row of the matrix a is the elements circled and therefore, the neighbor of a of 0 0 is a of zero one and the neighbor of a of zero one is 0 0 two and so on. Which means that in consecutive locations in memory one would find the different elements of the first row of the matrix and this would be followed by the elements of the second row of the matrix and so on. In some sense one could talk about the storage order as being row by row, if one looked at how they laid out in memory first the first row, then the second row then the third row remember that the memory that we have the main memory has a linear ordering of the bytes from zero up to two power given the number of bytes in memory. So, that is why the idea of row major order is to view the ordering of the array matrix elements when stored in memory as being row by row hence the name row major and this is in fact the convention used by c the c language convention used by c compilers. The alternatives what is known as column major order the one which have spoken of second, we will actually store the two dimensional array column by column in memory in other words first the first column zero of the matrix a followed in memory by the second column of matrix a and so on. And this order two is not ignored this order used by some programming languages such as Fortran so this is the interesting thing to note, which of these two storage orders is used for your multidimensional arrays of your program will depend on what language you have used to write the program. Now, coming back to our example of the matrix two dimensional matrix sum, we have understood the reference order. The reference order was load a of 0,0 load b of 0,0and then store b of 0,0 followed by load a of 1,0 load b of 1,0 store b of 1,0 and so on. In other words the order in which the elements of a here I have a diagram showing you the matrix a the first access a of 0,0 and next we access a of 1,0 if you looking at the order in which the elements of matrix a are being accessed by my program. So, the elements of matrix have been or reference by my program column by column that is the observation which I would make and similarly, I can observe that the elements of matrix b first I access b of 0,0 then b of 0,0 once again. But the next element of the matrix b which I access is b of 1,0 which means that in the case of b two in the case of b also, we are accessing the elements of d column by column which I we does which I will designate by this first the first column after that the second column and so on. Now, you would have notice that the program that we were we started with seems to have been programmed in c and we just learned in the previous slide, that as far as c is concerned the row major ordering of array elements is used. In other words if you looked at consecutive memory locations you would find first the first row of matrix a and then after that the second row of matrix a and so on. What is that mean it means that we are the program is accessing the array elements column by column, but the array elements are stored in memory row by row what is that mean in terms of the cache performance. What this tells us is that first of all a of 0,0 it was referenced and that is going to be a cold start miss at some sometimes soon after that a of 1,0 is reference, but we know that the block cache block containing a of 0,0 given that this is a program written in c will contain the neighbors of a in the row containing a. And therefore, the cache block will contain a of 0, 0 a of 0, 1 a of zero two and a of zero three whereas, the second reference I have made in this program is to load a of 1, 0 and therefore, this will not be a cache hit due to spatial locality of reference, but rather it will be another cache miss. So, therefore, this is a negative we notice that this is the program in the way that we have written it is not going to be benefit from spatial locality of reference it is basically going to have a of 0,0 a cold start miss a of 1,0 a cold start miss. Similarly, for b b of 0, 0 is going to be a cold start miss store b of 0, 0 will be a hit, because b of 0, 0 was referenced by the previous instruction. That other then the store instructions all the remaining instructions will not benefit from spatial locality of reference and this is the negative this means that our program is not was not written with knowledge of cache in mind. So, bottom line is our loop will show no spatial locality of reference a loop is showing temporal locality of reference notice that all the references to the store instructions benefit from locality. But they benefit benefitting from temporal locality not from spatial locality and practically our program will not really as far as we can see show any spatial locality of reference. Even if we assume that a and b are not conflicting with each other. So, in trying to come up with hit ration calculation, we can make the assumptions that we can use packing or some other technique to make sure that the accesses to a and b do not conflict with each other due to base address type of a problem. But the situation is going to be that in the iterations we will have a miss for a of 0, 0 a miss of b of 0, 0 and then the hit for the store, but that is going to happen for each array element. And therefore, we suspect that we may end up with the hit ratio of about one third or 33 percent which is much lower than the kinds of hit ratio we were getting with the previous examples that you looked at remember seventy five percent eighty three percent and so on. So, this is not that is not seem to be a well written program from the perspective of comparison with other programs that we were able to analyze. There was of course, one issue which I would want to raise the suggestion is that we have hit ratio of 33 percent or 33.3 percent to be to be more accurate. But there is an underline question which may be would be worth thinking about we notice that when a of 0, 0 was missed the block containing a of 0, 0 would be loaded into the cache and the block containing a of 0, 0 would have contained a of 0, 0 along with a of 0, 1 a of 0, 2 and a of 0, 3 which are the three neighbors in the row as far as the two dimension matrix a is concerned. So, the block will contained those four array elements unfortunately, the next few iterations of my loop do not refer to a of 0, 1 a of 0, 2 of a 0, 3 which is why we said that the program does not show good spatial spatial locality of reference. But we do know that some time later this program is going to access a of 0, 1 a of 0, 2 and a of 0, 3 hence, this question much later in time when this program executes the loop where it loads a of zero one will a of zero one still be in the cache can we calculate whether a of 0, 1 will still be in the cache if that is the case, then a of zero one the load of a of 0, 1 will actually be a hit the load of b of 0, 1 by the same token would be a hit and so on. And therefore, the hit ratio may be substantially better than 33.3 percent and we might actually benefit from spatial locality of reference. So, the question is how do we analyze whether later on in time when our program does in fact try to load a of 0, 1 the block containing a of 0, 1 which was fetched by the first load instruction in the program will still be in the cache. In one way to reason about this rather quickly is just to bear in mind that in our particular program, we have a situation where there are row, if you look at the size of each column we have a situation where the size of the each column is 1024 elements and therefore, you should note that when we will come to loading a of 0, 1after we have finished going through the inner loop once. And therefore, the questions has to be analyzed in terms of how much of the cache would have been filled by the time we come back to the outer loop and change the value of j to one. And with this calculation we will be able to get an answer to the question about whether a of 0, 1 would be present in the cache or not and based on that calculation we have to figure out whether the hit ratio might in fact be higher than that we have calculated over here I will leave that for you to analyze on the side. So, with this we have an understanding that, if multidimensional arrays they may be something a little bit more complicated and there we may have to go back to the source code the or c program in order to prove it in order to make the hit ratio higher. Now, the simple idea which we are go try for to do for to analyze this something which is called loop interchange and the idea of loop interchanges do you remember that we started off with a vector two dimensional matrix sum the program look like this. And you will remember that the outer loop was the j loop and the inner loop this is what we had earlier was the i loop and it seems to the loop we were accessing a of i j and b of i j. Now, what if we realize that the problem with this loop was that we were going through the through the two dimensional matrix is column by column and that we could correct for that by just going through the two dimensional matrices row by row. There is nothing the definition of matrix sum two dimensional matrix sum, which says that we have to go through the matrices column by column. We could just as well have gone through them row by row and one way to achieve that is actually to interchange the j loop with the i loop. In other words instead of having the j loop as the outer loop and the i loop is the inner loop I could modify the program so that the i loop is the outer loop and the j loop is the inner loop which is what i have done. Now, notice that the outer loop is now a loop in which be I vary the value i I being the first index into the matrix a and the matrix b and the inner loop is now the j loop which means, that initially the first time you go through the outer loop i will be equal to zero, then you go to the inner loop and j will be equal to zero and therefore, you operate on a of 0, 0 and b of 0, 0. After that you go to the second iteration of the inner loop where the value of j is change to one and you operate on a of 0, 1e b of 0, 1 and notice that this is now stepping through the two matrices row by row. Therefore, bearing in mind that this was a c program and that in a c program the two dimensional matrices would be stored row by row by just correctly interchanging the two the two loops the inner loop the j loop and the i loop along the lines of what we see here. We would be causing the program to have a reference sequence which is the same as the storage order for the array elements. So, we now have a situation where the modified program and all that I have done is to interchange the line four i line by the four j line that is only change that was made. The reference sequence will now be load a of 0, 0 load b of 0, 0 store b of 0, 0 followed by load a of 0, 1 load b of 0, 1 store b of 0, 1. And now, we notice that the program will benefit from spatial locality of reference the load b a of 0, 0 will be a cold start miss load b of 0, 0 will be a cold start miss store b of 0, 0 will be a hit load a of 0, 1 will be a hit, because of spatial locality of reference. Similarly, load b of 0, 1 and similarly, load a of 0, 2 load a of 0, 3 will be hits and therefore, we get a substantially improved hit ratio in fact the 83.3 percent which we had the best kind of hit ratio that we have seen from our previous examples. So, clearly in time to deal with programs that have multidimensional arrays or the two dimensional arrays that we are looking at right now it is important to have the order in which the array elements are accessed carefully thought out. Otherwise you will a have situation whether reference sequence differs from the storage order of the array elements and therefore, spatial locality will not become available and the program will not benefit from the spatial locality properties benefits of the underlying cache hardware. Now, this idea of loop interchange is interesting the idea that one could interchange the i loop and the j loop, but this is similar to something that we have seen earlier. You will recall that when we are talking about instructions scheduling I interchanged instructions at the machine language level. And at that point I had point made special mentioned that it may not always be safe to just arbitrarily interchange to instructions since the meaning of the program could change. The net effect of executing the modify program could be different from the net effect of executing the preexisting program and therefore, we do have to ask the questions of whether one can just interchange the i loop and the j loop without changing the meaning of the program. Now, for this particular example the two dimensional matrix sum you would quite easily be able to satisfy yourself, that there is no harm from interchanging the i loop and the four. And the j loop in that each time through this loop by just dealing with array element a of i j of b and the corresponding element of a adding them together and modifying the array matrix b and that therefore, whether I go through the matrices row by row or column by column the net effect is going to be same on the result. In other words the final value of the matrix b and that the only difference is that the version in which I have interchange the loops will run faster, because of the cache improved cache behavior. But they could well be loops in which it is not safe to do this kind of an interchange and maybe we have to understand loop interchange little bit more carefully in order to figure out. Whether loop interchange is a generally applicable phenomenon or whether we have to do it much more carefully with little bit more caution and I would ask you to just check satisfy yourself for this particular loop it is perfectly safe to do the loop interchange is absolutely no danger of the program doing something different. Now, in the next lecture we will actually start by moving to an example where it is not safe to interchange the loop. So, we will try to analyze why it is not safe before proceeding to some other examples in this sequence, in this class you will recall that we have looked at some useful additions to our reporta of techniques to improve cache performance. We had seen the idea of packing or changing the base address of a matrix of a vector in the previous lecture, we added to that the idea of actually merging two vectors into a single vector to get the benefit of spatial locality across the two array references. And we were looking at multidimensional matrices and realize that it may be important to make sure, that the storage order in terms of how the compiler is assigning addresses to the elements of the multidimensional array should be well aligned with the order in which the program references the array elements, otherwise once again the program may not benefit from spatial locality of reference and we stop here for today. Thank you.