Mod - 06 lec - 29 cache aware programming

Welcome to lecture twenty-nine, of our course on high performance computing. We are currently looking at the operation of cache memories, trying to understand how they affect the performance of our programs. Towards the end of the previous lecture, I had giving you a rough idea of one of the determinants of computer development, and I have shown this to you in the form of a graph, which is often described as the Moore's law graph. What this graph actually plotted was over a period of time, where on the x-axis, there was the linear progression of time with years. How the performance of computer systems or the components of computer systems was improving due to technology developments and the important thing about the Moore's law graph was that, it was showing you on the log scale on the y axis, linear scale on the x-axis, that the speed with which processor circuits were becoming faster was much more than the speed at which memory circuits were becoming faster. In other words, the speed disparity between processors and memories, was growing faster and faster under the Moore's law prediction, and the consequence of this was that cache memory hierarchies may have ended up becoming a little complicated, but before actually coming to that, I did want to comment - if you look at the Moore's law graph, which was one of the slides in the previous lecture, then you would have noticed that the graph was plotted for a period of time from about 1980 until about 2000, in the graphs that I used. Originally, Gordon Moore had plotted the graph for data from then much earlier points in time, 70s and into the 80s, and subsequently, people had found that this trend had continued. Occasionally, the slope of the performance improvement line may have been more or less, depending on how technology trends had moved. But one additional fact, which should be borne in mind to understand why the consequences of the Moore's law graph may not be that much today, should be by bearing in mind, that when the performance of a computer system goes up, it is possible that the amount of power, that the amount of energy, that is consumed for the execution of a program may also have to go up. By power, you might view the amount of heat that is generated when a program runs on the computer and it turns out that the power and the heat are alternative labels for the y-axis in the Moore's law graph. In other words, as one looks along time at the amount of heat generated by a processor, one finds the similar exponential or similar behavior to what was plotted for performance. As a consequence, it soon became over. As time went on, it became very clear that one - computer architects could not continue designing processors, which became more and more complicated, and more and more powerful, because at the same time, the amount of heat, the amount of power would be growing at a very fast rate and would become very difficult to cool down, and to even to provide the energy for these computers to keep on executing, which is why the technology trends did not proceed on in this direction for very long. And some tradeoffs have to be made between improvements in performance and reduction in power and so on. But in any event, the bottom line from our quick understanding of the Moore's law graph is that, if we had thought of a memory hierarchy in which there was cache, main memory and secondary memory, then, with time, as the speed disparity between processors and main memories became worse, there may have been the need for improving the nature of the caches, to have small fast caches in addition to larger slower caches, just to close the speed gap between the CPU and the main memory. Remember that the CPU operates at speeds which are comparable to those of the very highest part of the memory hierarchy, such as the registers. Now, with those additional comments to my quick discussion about the Moore's law from the previous class. Let's move on to an important topic in connection with cache memory, and that is how does our knowledge of cache memory change our perspective on the programs that we write. And I've titled this set of slides- caches and programming, since we are going to look, we understand enough about the cache memories now and the way that they operate to be able to view performance of our programs, if running on particular kinds of cache hardware. Now, in order to do this; Our objective in doing this, rather, is to for the programs that are, that we write or parts of programs that are important to us, to learn techniques using which we can assess the cache related performance issues. In other words, get some kind of an understanding of how a particular part of a program of ours would execute or would benefit from a particular kind of a cache. That is our objective, so, to put on knowledge of cache together with our understanding of programs, to get some benefit from the programming side. The way I am going to do this is by basically looking at a number of examples of simple programs. Now, to keep the discussion somewhat simple, but yet realistic, we will not take into account, the behavior of the instructions of our programs. So, when I talk about a program or an important part of a program, we will concentrate primarily on the interaction between the program and memory, as far as it is data accesses are concerned. We could assume for example, that we are running the program on a computer system which has separate instruction and data caches, and that we will assume that the instruction cache behavior is not of interest to us right now. We are concentrating only on the data cache. That is essentially, what we will start by doing most of the examples that we will look at. So, we will consider only the data cache in our examples and typical kind of a data cache configuration, I will use in these examples, may be something like this. I will use some variations on this kind of a configuration, but I just wanted to remind you about what we have learned about the 4 q's of cache organization. So, in this particular quick description of a cache, I mention that the size of the cache is 16 kilobytes. I mention that the size of each block in the cache is 32 bytes. I also mention that the cache is direct mapped, from which you understand that direct mapped block placement is used, which means that for any main memory block there is a unique cache block into which, that main memory block might be copied. In addition, I am informing you that the cache uses the write-back policy which means that on a write-hit on a write access to the cache, which is a hit in the cache, the main memory is not updated at that point in time, the update is done only for the cache copy. And subsequently, if that particular cache block is replaced, the main memory copy would be updated. So, these were some of the key words from our understanding of cache organization in the frame work of the 4 q's of cache organization and I will give you similar brief descriptions of the cache, in order to assess the behavior of a particular program. Okay, Now, for this particular cache organization, remember that we were able to use the information in this description to understand how an address generated by the processor would be viewed by the cache hardware, and just to quickly remind you, we had to decide how many bits were going to be used for the block offset, how many bits were going to be used for the cache index and we knew that the remaining bits would be used for the tag. For this particular example, given that the size of a block is 32 bytes, we immediately know that the offset is going to require 5 bits, because log base 2 (32) is equal to 5 from the fact that, its direct mapped with the 16 kilobyte cache. You were able to calculate that 16 kilobytes is the size of the cache and the size of each cache block is 32 bytes from which, we know that the number of cache blocks is 512, 16 k divided by 32 and log base 2 of 512 is the 9 bits that are needed for index. So, we were able to calculate the number of bits for index, the number of bits for offset and assume that the remaining bits are used for the tag and therefore, this is the form in which this information was used in our analysis of, in our description about how the cache organization went about. Now, we are going to start with a very simple example of a code fragment. Now, the particular example I will start with, I describe as "vector sum reduction" and from the description, you will be able to understand that this seems to be an operation or a program which is doing something on a vector of data, and it seems to be doing a sum reduction on the vector of data. The word reduction suggests that there are some reductions or lessening in the amount of data, and the word sum suggests that, the way that the reduction is done is by adding the values of all the elements within the vector and that is exactly what this simple program is going to be doing. So, it is as shown by this c implementation. So, in this example I have a vector which is implemented as a floating point array A. The size of the vector is 2048 which means that the array elements will be referred to as A of 0 up to A of 2047 and in addition, I have indicated that each of the array elements is of type double. Remember that a double float is a float which is not 32 bits in size, but which is 64 bits in size which means that the size of each array element is 8 bytes and this particular code segment, what is being done is, for i equal to 0 i less than 2048 i plus plus. The elements of the array of the vector A are being added to a running sum which has been initialized to 0. So that, at the end of the execution of this for loop, the variable sum will contain the sum of all the vector elements, which is why it is called a sum reduction. At the end of this, a small piece of code, the variable sum contains the sum of all the vector elements A of 0 through A of 2048. So, that is a nice simple code example. It is of course not a complete program, but if we can understand this particular program from the perspective of our cache behavior, and if this is a very important part of a program of ours, then we probably have a good understanding of the overall behavior of the program itself, a larger program. Now, recall that I indicated that, we are not going to concern ourselves with the instructions corresponding to this program. We are going to concentrate on the data accesses generated by this program, but in order to do the analysis, it will probably be easier for us if we can view the program not in this C form, but in a machine form, in a machine code form. So that, we can actually see the individual loads and stores. In other words, we can identify what the different memory accesses are. Remember, we are going to have to view this program in terms of the loads and stores that it sends to the cache memory, and exactly what address is associated with each of the loads and the stores. Only then will we be able to talk about how the cache memory will perform for a particular program. So, we need to move down from understanding vector sum reduction in C to understanding the vector sum reduction in terms of the load and store instructions, how many load and store instructions are executed and what are the addresses associated with the operands of the load and store instructions. Okay, Now, if you remember from our discussion of the simple code examples for loops, that was somewhat similar, we had a loop index and in the examples that I used, the loop index was typically implemented inside a register and therefore, the loop index i may be not a variable that we have to worry about as far as the cache behavior is concerned. So, if we were to assume that the compiler generates machine language code for this particular loop in such a way, that the loop index i is handled inside a register, rather than being accessed out of memory each time the variable i is referred to. The variable i is refered to several times in each loop iteration. So, if I make the assumption that the loop index is accessed out of a register, then I can ignore the loop variable, the loop index i completely in my analysis of this program. In addition, I could assume that the sum variable also is handled outside through a register, and it should be fairly easy to see that a compiler could quite easily do that or if the compiler did not do it, I could modify the program so that, it did that in order to reduce the number of memory accesses that this program makes. So, with these two assumptions, our perspective on vector sum reduction will actually get simplified substantially. We would no longer worry about the accesses to "sum" or to "i" in this program and what we are left with is, just the references to the array A, to the vector A and in short, that is how we will view this particular program. We will view this program not in terms of its instructions, because we have already decided that we will be worried only about the data cache behavior of the program. Further, we will completely ignore the loop indices, we will completely ignore many of these variables, in this case, it is only one variable, "sum", and will be primarily worried about the accesses to the vector A. Now, if we are primarily worried about the accesses to the vector a, then viewing this program in terms of the machine instructions is very simple. I do not even have to worry think about how this program gets compiled. I very clearly understand that, the first time through the "for" loop, there will be a need to load from A of 0, into a register, because, subsequently the value of the contents of that register will be added to the variable sum, which is also in a register. Therefore, each time through this loop, the loop corresponding to vector sum reduction, there will be a load instruction which loads an element of the vector into a register. So, in short, in thinking of the vector sum reduction, I can think about the vector sum reduction as being a series of load instructions. The first time through the loop, it is a load of A of 0 second times through the loop. It is the load of A of 1 and so on. So, I view the vector sum reduction as being a sequence of 2048 loads, load of 0 through load of 2047. Remember that these were the, that the size of my vector was 2048 hence the sequence from load of A of 0 to load of 2000 A of 2047. Now, in order to proceed to trying to think about, what the impact of this program and remember- this is now the program that I am worrying about, this is what I have distilled out of the vector sum reduction loop, this is the essence that we are going to concentrate on. In order to proceed to understand the impact on the cache, I have to be able to calculate the address of each of these vector elements. It is not enough to talk about A of 0, I have to talk about a specific address. Therefore, I will make an assumption about what the address of A of 0 is. So, I will assume that A of 0 has some address in this particular example, I am assuming that the address of A of 0 is hexadecimal A000. Let me remind you that, if whenever you see a number which has 0 x in front of it, is a hexadecimal number. We have not use hexadecimal for some time now, but this is telling me that the address which is of concern to us is being given to us in hex, which means that base 16 and A is 1010. These are the 3 zeros, if I wanted to view this as a 32 bit address, then I would put additional zeros to make it into the required 32 bit address. Now, if I am assuming that the address of A of 0 is hex A000, what can I assume about the address of A of 1? What I will assume is that, the way that the compiler lays out the array inside the virtual address space and remember, it is the compiler that decides the order in which different elements are placed in the virtual address space, corresponding to the process that will come in to existence when this program executes, the compiler makes a decision. I will assume that the compiler, if it assumes that the address of A of 0 is hex A000 will put A of 1 into the next memory location, and it will put A of 2 into the memory location following that and so on. In other words, it will put the elements of the vector into contiguous neighboring memory locations. That seems like a reasonable assumption, then I can now view each of these memory addresses from the perspective of how the cache will look at that particular memory address. And let me just go back to our understanding of the cache. In this particular example, we are going to assume that the cache is direct mapped 16 kilobyte, write through with 32 byte block size, which means that in looking at the address, we have to view the address in this way. The 5 least significant bits of the address are offset, this is followed by 9 bits of index and then 18 bits of tag, if we are talking about a 32 bit address. Now, when I think about the address hex A000, I can view it in this light, I can look at the least significant 5 bits, I can look at the next 9 bits and so on. What happens when I do that? The least significant 5 bits, remember the least significant 5 bits for the offset. The next 9 bits were the index, and the remaining 18 bits were the tag. So, if I will remove the least significant 5 bits as being the offset, and then I count the next 9 bits 1 2 3 4 5 6 7 8 9, what I come up with is this value, the bit value 1 followed by 8 zeros, and that instead of referring to it as binary 1 followed by 8 zeros, I will refer to it as decimal 256. You can evaluate this as 2 power 1, 2 power 2, etc. This is actually decimal 2 power 56. So, in short, for the particular cache organization that we are using, a 16 cache with 32 byte block size under direct mapping, the array element A of 0 will have its index bits equal to 10000000 which is decimal 256. What about array element A of 1? Now, in order to find out the address of array element A of 1, I have to know how much space A of 0 occupies. But we know that the vector sum reduction loop is dealing with double float values from which we understand that, the size of each array element is 8 bytes. So, the size of an array element under double is 8 bytes, which tells me that in any particular cache block, given that the block size is 32 bytes, 4 consecutive array elements will be present. So, in any one of the cache blocks, let us suppose this is one of the cache blocks of size 32 bytes, we actually have enough space to store four 8 byte doubles. Therefore, if I have A of 0, I could also have A of 1, A of 2, and A of 3, all within the same cache block. That is what this simple observation allows me to learn. In other words, if A of 0 has address A000 and A of 1 has address A008, which is 8 bytes after the address of A of 0 and so on, and I do the same calculation using its least significant bits, I will find out that A of 0, A of 1, A of 2, and A of 3 are all going to be in the same cache block and will all have that index 256. In other words, once A of 0 is in the cache, I will know for sure that A of 1, A of 2, and a of 3 are also in the cache because they all occupy, they all come from the same cache block. They just differ in their index bits, if you look at the addresses of A of 0 through A of 3. In fact, this what I have over here. Here, I am showing you the addresses of A of 0, A of 1, A of 2, and A of 3. Notice that A of 0 is 1 followed by all zeros. We are looking only at the index bits and bits following. A of 1 is going to be 8 bytes later, which means that it will have the address A008. A of 2 is going to have 8 bytes after that, which is going to be 100000 at the last bits and so on and you will notice that they all have the same index bits. So, with this observation we can project, look forward to the array elements A of 4 through A of 7 and do the same calculation. And we would find out that, given the base address of A of 0, we could calculate the base address of A of 4 and find out that the base address of A of 4 is such that, it will have an index of 257 within the cache, that we were talking about and A of 4 through A of 7 would all have that same property. In other words the next 4 array elements, the first 4 array elements, all have index of 256, cache index of 256, then the array of the next 4 array elements after that, all have cache index of 257 and so on, and this is entirely because the 4 array elements A of 0 through A of 3, all occupy one cache block, given that the size of a cache block is 32 bytes. Similarly, array elements A of 4 through A of 7, all occupy one cache block and this could proceed, this reasoning could be used to argue all your way through the complete program, in other words from loading of A of 0 all the way through loading of A of 2047. But to do this systematically, we will actually come draw a table in which we will try to calculate the number of cache misses and the number of cache hits that would result for that vector sum reduction program,. So, I am going to a draw table of this kind. So, what I have drawn in this table, in this table in the left most column what I am showing you is the program itself. The program first loads A of 0 then it loads A of 1. This is a program which, the only interesting instructions in this program to us right now are the load instructions. In the program, it first loads A of 0, then it loads A of ,1 then it loads A of 2 and keeps on going until it loads A of 2047. Now, we know that the base address of the array, which is the address of A of 0, we assumed was hex A000 and given that the size of each array element is 8 bytes, we calculated the base, the address of A of 1, the address of A of 2 and so on. So, we were able to calculate the addresses of each of the array elements using the base address, assumption and our knowledge about the size of each array element. The size of each array element is decided by the type declaration used in the program. In our particular program, each array element is of type double which means that occupies 8 bytes, which is why there is an 8 byte difference between each of these addresses and then, knowing what the cache hardware does with an address for the particular cache that we are concerned about, I can understand what the index bits for each of these addresses is and as we had said, the first 4 array elements have index of 256, the next 4 array elements have index of 257, and we can keep on doing this and in doing so, we will find out that the last 4 array elements have index bits of 255. Now, the next question is, how well is the program going to do on hardware which has a cache of the kind that we had described? In other words, how many of these accesses are going to be hits and how many of these accesses are going to be misses within the cache? Now, the way to think about this is, we could start by assuming that, when the program starts executing, none of the array elements will be present inside the cache and therefore, the when the program executes its first load instruction, load A of 0, it is going to encounter a miss. Nothing would have come into the cache since nothing had been requested by the program before, as far as the address space of this process is concerned and therefore, we would guess that the first access is going to be a miss. However, as a result of that miss on A of 0, there is going to be a fetching of a block from the main memory into the cache and that block is a 32 byte block which contains not only A of 0, but also A of 1, A of 2, and A of 3. Hence, when the second load instruction is executed to load A of 1, we have guessed that the load instruction of A of 1 is going to benefit a cache hit, as will the load of A of 2, as will the load of A of 3 and we can continue this kind of thinking for all the addresses in this table and what we will end up with is something like this. Let me just get read of this unnecessary notation here. So, what we see here something like this.So, once again, this is the same table we had before. We had ,let me just go back and make sure, yeah it is exactly the same table that we had before. So, this is the column with the array elements, this is the column with, they are addresses based on the assumption, the array of 0 is that address A000. These were the index, or the cache indices which we calculated from that and this was the reasoning that we came up with. The first access to array element A of 0 would be a miss because the program, initially, when it started executing, would not have had anything fetched into the cache by the hardware. But once A of 0 was a miss, the cache block which contains A of 0 will also contain A of 1, A of 2, and A of 3 and hence the accesses to load A of 1, to load A of 2, and to load A of 3 would all be hits. After this, when the time comes to fetch A of 4, once again, it is a different cache index, the cache hardware would find out that there is a miss, but the subsequent four accesses to A of 5, A of 6, and A of 7 would be hits and so on. Now, in terms of terminology, people often talk about that first miss, which happens because your program has not previously accessed the particular cache block. It is often referred to as a cold start miss. Since when your program starts executing, we assume that the cache is initially empty and it is compulsory, there is no way to avoid this miss and its starting because the cache started off empty. So, this particular kind of a miss might be described as a cold start miss, its coming because the program initially started executing in a situation where the cache did not contain any of its data. Subsequently we have the hits. Now, we can go ahead and calculate the hit ratio for this particular program for the particular cache that we are talking about, and we find out that the hit ratio is in fact, 75 percent. How do we come up with this? We know that there are total of 2047 accesses of which, every fourth access is a miss and the remaining three fourths of the accesses are hits, which tells us that there will be a hit ratio of 75 percent. So, 75 percent of all the memory references which this program, a 75 percent of the accesses which this program makes to memory will actually be satisfied directly by the cache, which we had talked about. More specifically, out of 2048 memory accesses that this program makes, 1536 will be hits, and if we think about it a little bit, all of these 1536 hits are happening because of spatial locality of reference. Why do I say it is happening because a spatial locality of reference? It is happening because; just think about the 3 hits that we had over here. Why was the reference to A of 1 a hit? It was a hit because of the fact that, the cache hardware is designed in terms of blocks, in which any one of the blocks contain 4 neighboring elements of the vector A, spatial locality is automatically being exploited, I get a miss on A of 0 and I have to- the hardware has to get the block from memory into the cache, but the benefit that one gets is the neighbors of A of A of 0 are immediately present out of the cache, and that was what we talked of a spatial locality of reference. This particular program is showing very good spatial locality of reference and that, after A of 0 was referenced, its three neighbors A of 1, A of 2, and A of 3, were in fact, referenced in the immediate future and if we had a larger block size, we would have got more and more benefit out of the spatial locality of reference. This particular cache has a block size of only 32 bytes, which is why we got the spatial locality of reference benefit for only the next 3 elements. Now, other observations which we could make, this is an interesting observation- let me just read it out to you. Let us imagine that we had written our program in the following ways: If the loop that we had just seen was preceded by a loop that accessed all the array elements, then the hit ratio of our loop could be 100 percent. Let us just think about this a little bit, so, the idea that I am working with here is, we had this vector sum reduction and if we are now considering this loop, and this loop- we now understand, for the cache that we are concerned about, we know that for this particular loop, the hit ratio was 75 percent because A of 0 suffered a miss, but A of 1, 2 and 3 suffered hits giving a 75 percent hit ratio. The current suggestion is that, knowing this, I could actually have included another loop over here, in which all that I do is, for each of the array elements, I just access A of i. So, this could be something like s equals A of i, just an access to A of i. Now, what is the purpose for doing this? What I am doing is, I am preceding the loop, this is the loop that I am interested in. I am preceding the loop that I am interested in by another loop and all that this loop is doing is, this accessing each of the array elements. What is the purpose of that loop? The purpose of that loop is actually to cause the contents of the entire array to be present in the cache. So, when the loop that I have just written executes, it is going to cause A of 0 and A of 1 and A of 2 and A of 3 to come into the cache. As a consequence, when the loop that I am interested in starts executing, the entire vector A will be present in the cache and therefore, rather than executing at a hit ratio of 75 percent, the loop that I am interested in will execute at a hit ratio of 100 percent. It will get 100 percent cache hits. Every single access to the array during this loop will be a hit. What is the price that I am paying? The price that I am paying is, I am executing additional instructions, I am executing an additional 2048 load instructions, and of those 2048 load instructions, how many of them will be hits and how many of them will be misses? If I actually do 2048, then I will get 75 percent hit ratio on that loop, but the overall hit ratio of the entire program will be more than what I had originally, in other words more than 75 percent and that is therefore, an interesting idea. Let me just put down what we have over here. The observation is, that if I had preceded the loop that I am concerned about by a loop which essentially just accessed all the array elements, thereby causing them to be present in the cache when my vector sum reduction loop executed, then my vector sum reduction loop would execute with 100 percent hit ratio and in this particular case, I could not say that all of the hits are because of spatial locality of reference. I would be able to say, the 25 percent of the hits are due to temporal locality of reference. Why temporal locality of reference? Because the previous loop which I had executed had accessed A of 0. Therefore, a little later on in time, when my vector sum reduction loop executed and accessed A of 0, because of temporal locality of reference, it found A of 0 within the cache and therefore, it benefited from a hit. Therefore the 100 percent hit ratio that my vector sum reduction loop benefits from, will be 25 percent due to temporal locality, and only 75 percent due to spatial locality. Of course, the overall behavior of the program is not improved because I have caused more instructions to be executed and therefore, the program will take more time. In fact, the program will suffer as many misses in other words as many main memory accesses as it did, without the preceding loop. Therefore ,this is just for the purpose of understanding that it is possible to exploit both temporal and spatial locality of reference and that your program could be doing one or the other and what is actually important is- to understand how your program is behaving. Now, let us understand little bit more about what is happening by looking at a small variant of example one. Now, you will recall that in example one, we did vector sum reduction on a vector, a double vector of size 2048. Now, I am going to modify that. I am going to consider vector sum reduction with vector of size 4096. So, this is a vector which is double the size of the previous vector. Now, what will happen for this particular, for the same cache that we have talked about in the previous example, what will happen for this particular cache? And the first of all why should it make a difference? Now, the reason that we should understand that it makes a difference is that, think about the previous example that we had where the size of the vector was 2048. What was the actual size of the vector? The vector contains 2048 elements, and the size of each element was 8 bytes which means that the total size of the vector A was 2048 multiplied by 8 and if you do this calculation, you will see that it is of size 16 kilobytes, and what is important about 16 kilobytes? 16 kilobytes is the size of the cache that we are currently dealing with. In other words, when I had vector sum reduction on an array of size 2048 elements, the entire array could fit into the cache. On the other hand, if I deal with vector sum reduction on in array of size 4096 double precision floats, then the size of the vector is now going to be twice the size of the cache. In other words, this vector will not fit into the cache completely. The vector is larger than the cache. Therefore, if I had considered preceding my vector sum reduction loop by loop, which just accessed each of the array elements and the array was of size four 4096, then since there, if entire array no longer fits into the cache, there is no benefit that I will get in the case of the 4096 example and I will still suffer the misses. In other words, just preceding the loop of interest to me, the vector sum reduction loop, by a loop which accesses each of the 4096 elements could not help at all. It will just make the program slower. I will still get only 75 percent hit ratio and this is entirely because this vector itself does not fit into the cache. So, after execution of the preceding loop, the second half of the array will be in the cache and our loop sees misses as we just saw. So, there will be no benefit from preceding the loop by a loop, which accesses each array element as we saw in the case of the smaller vector. We, in the case of the smaller vector, by having that is the small loop before the vector sum reduction loop, I could artificially give the vector sum reduction loop hit ratio of 100 percent, but that is not going to happen if I have a larger vector in which case, I will get only the hit ratio of 75 percent, even for the vector sum reduction loop. Now, the kinds of misses that cannot be avoided in this case are what are called capacity misses. We could not avoid the capacity misses for the larger vector because of the capacity problem. The vector is bigger than the capacity of the cache and there is no avoiding those misses whereas, for the smaller vector, since the vector was smaller than the size of the cache, I could avoid some of the capacity misses. Because, that the cache was capacious enough or large enough to accommodate the entire vector and I could therefore, preload in some sense, the vector into the cache by using a preceding loop which accesses to each of the array elements- vector elements. So, we can now move to another example. I am going to move to an example, which is a little bit different from the previous example that we just saw- when we are going to proceed along very similar lines. Once again, we are going to look at a c version of an operation. We are going to understand what the operation is. You will ignore all the instructions. We will ignore all the variables that we can like loop indices and so on and concentrate on the critical elements. The critical data elements accessed by the piece of code. Now the second example we are going to look at is something called vector dot product. So, this is something when if you had a word of it, is the multiplication of two vectors resulting in a value. So, the scale of value. So, in this particular example then, I need to have in the c code. I have a dot product of a vector A of size 2048 with a vector B also sized 2048 and the dot product is going to end up in a variable called sum. So, what the dot product does is, it multiplies A of 0 by B of 0, multiplies A of 1 by B of 1, multiplies A of 2 by B of 2 and so on and adds all of these products together into the running sum constituted by the variable sum. We are going to use the same size of cache. The same cache organization that we had used in the vector sum reduction example. So, once again, we have to reduce this c program into a version that we can understand the cache from. In other words, we have to reduce into a sequence of loads and stores and then we have to make some assumption about the address of the different, the base address of A, the base address of B and then we can actually calculate the hit ratio for this particular program. So, instead of actually running through all the reasoning from the previous example, once again we can immediately shortcut to a description of what the references that will result from this program are. Once again, we will ignore any reference to the variable i because, the variable i can be is stored in a register and therefore, need not show itself to the cache memory, need not result in a load or store instruction. So, any reference to i can be ignored right then. i need not be loaded. Similarly, I will assume that the sum is maintained inside a register and therefore, all that we are left with is the loading A of 0, loading B of 0, then second iteration- loading a of one, loading b of one and so on. So, therefore, we have reduced the vector dot product program to the following reference sequence. So, these are the memory references that the program would make, under the assumption that we have had to make initially. The first reference is to load A of 0, then there is an instruction to load B of 0. Then the next iteration of the loop- load A of 1, load B of one and so on. Therefore, it will end with load B, load A of 2047, load B of 2047. Now just, as in the case of my vector sum reduction, I need to make some assumption about the base the addresses of the vectors A and B and therefore, I will just assume that the starting address of the vector A is A000 in hex and I will assume that the starting address of vector B is E000, once again in hex. Once you will observe that, we are still dealing with double precision array elements and therefore, just as in the case of A vector sum reduction, each cache block will be large, is large enough to hold for consecutive array elements. In other words, A of 0, A of 1, A of 2 and A of 3 will all be stored in one array element and one cache block. Similarly, for B, B of 0, B of 1, B of 2, B of 3 will all fit into one cache block. Subsequently B of 4, B of 5, B of 6, B of 7 will be in the other cache block. So, we could run through the same mechanisms that we used in the vector sum reduction and we would end up with a table that looks something like this- let me just run through each of the columns and rows of this table. So, we understand what is happening. So, in the first column- we have the program itself. Remember the program loads A of 0, then it loads B of 0, then it loads A of 1, then it loads B of 1 and so on. It ends with loading. This should go up to 2040 odd. So, it ends up loading until the last array element. So, there is a mistake over here. Now, in the next column, we have the address is corresponding to, I will just bring your attention to the fact that the arrays of size 2048. In this particular table, I have gone only up to be A 1023. We should have gone up to B of 2047 and the address will be different, but regardless of that. So, the base address of A of 0, we are assuming is A000. The base address of B of 0 is E000. We know that A of 1 is going to have an address, which is 8 bytes after the address of A of 0. Similarly, B of 1 and so on. So, we can understand column, the second column. Based on our understanding of the second column of vector sum reduction, once again using the same principle as far as the interpretation of the bits is concerned. We calculate that, the base, the index, this is the index, cache index associated with A of 0 is to for 256. If you do the same calculation for B of 0, you will notice that the index is exactly the same and as we had seen A of 1, A of 2 and A of 3 are going to have the same index as A of 0. Since they come from the same cache block. So, this is the situation that we will end up with, in terms of the assumption that the base address of A of 0 is A000, and the base address of B of 0 is E000. So, what we have now is the situation where, if we assume that the program starts in the cold state, then as before, we have argue that the access to load A of 0 is going be a miss because it is the cold start. What about the next reference which is to B of 0. Now B of 0 is going to be a miss, because we know that, A of 0 and B of 0 have the same cache index, which means that in this direct mapped cache, only one of them can be in cache at a time. Since my first reference would have brought A of 0 and its block into the cache, we can be sure that the second reference to B of 0 will be a miss and we could think of this as a cold start miss in the sense that, we have not referenced B of 0 before and therefore, even if the program had been on the other hand, if the program had been running a little bit as running instructions prior to this, B of 0 could have been in the cache But this is happening because the cache starts of empty. Now, when we move on to A of 1, let me just back of little bit. The reference to B of 0 is a miss, and we are arguing that we could describe it as a cold start miss. But as a consequence of B of 0 being a miss, the block containing B of 0 is going to be fetched from main memory into the cache. Therefore, the block containing A of 0 would have been removed from the cache, and the block containing B of 0 occupies that particular direct mapped cache block. Therefore, when the reference, the next reference to A of one happens, it will find out that it is a miss because, the block containing B of 0 is currently in the particular cache block, which should have been occupied by the block containing A of 1 and we refer to this as a conflict miss. The conflict miss is a situation where, because of the accesses made by my program, a miss has resulted. If my program had not referenced B of 0, between the reference to A of 0 and A of 1, then this would not have been a miss. It would, in fact, have been a hit. Because of spatial locality of reference, as we saw in the case of the vector sum reduction loop. Therefore, this is the miss that we encounter as a result of A of 1, is a different kind of a miss, from the cold start situation. It is a miss which is, because of the way that the program has been written and it is therefore, called the conflict miss. It is a miss due to conflicts in the cache block requirements from the memory accesses of my program of the same program. As we run through the sequence of events, unfortunately we find out that all of the references made by this program will be misses. There is not a single access made by this program in the sequence among the references that we are concerned about. The references to the array A and the array B, vector A and the vector B, that is going to be a hit with this particular cache. In short, we are going to see a hit ratio of 0 percent. This program is going to suffer from a hit ratio of 0 percent. Every single access to array A or array B is going to have to be satisfied out of memory. It will involve fetching the block from memory into the cache and reading the array element out of the cache. But that is going to happen for every single access made by this program. The program suffers a hit ratio of 0 percent or miss ratio of 100 percent. What is the source of the problem over here? The source of the problem here is exactly the same problem that we had identified with direct mapped caches. In general, this is a situation where I am accessing A of 0 and then B of 0 and A of B of 0 conflicts with A of 0. It actually wants to occupy the same cache block because, A of 0 and B of 0 have the same index. The 2 different entities which have the same cache index, both wanting to occupy the same cache block under the direct mapping scheme. So, the problem that we had identified with direct mapping is hitting us here. We have a situation where the elements of the array A and the array B, even though there accessed in order, are being accessed with the same cache index. They are conflicting with each other and therefore, our direct mapped cache is suffering from it, is the worst possible situation. Now, one simple observation which we could make here is that, the hit ratio could have been made better or the hit ratio would have been better if the base address of B had been selected such that, it differs enough from the base address of A. So, that they have different index bits. We will note that, we had made the assumption that the base address of A was A000 and that the base address of B was E000 and that is where we came up with. This conflict from the fact that both A of 0 and B of 0 had in index the cache index of 256. Both of these happen because of the assumptions about the base addresses. Remember the base address decides the index, the cache index because, it is bits from the base address which would be used by the cache hardware to decide where to look into the cache under the direct mapped scheme of things. Therefore, this is the sort of bases for trying to improve the quality of our program. If we can somehow come up with the mechanism by which we can cause the base address of B, in other words the address of B of 0 to be different enough from the base address of A, So that, they do not have the same index bits, then we can improve the hit ratio for this program to, the question becomes how can we do that? I am going to describe one idea which I will refer to as- packing by which, a programmer can get some kind of control over the base address of something like B of 0. Now, the base assumption. The assumption that we are going to make in coming up with this scheme is, we are going to assume we know that it is the compiler that assigns virtual addresses to the different variables in our program, like the vector A, the vector B and so on. I am going to assume that the compiler assigns addresses, virtual addresses to the different variables as an when it encounters the declaration of that variable. Therefore, it assigns the address hex A000 when it comes across double 2048. Next it encounters double B 2048 and therefore, it assign the address hex E00. Now, if I work with this information, and I want to cause the base address of B to be different then, I could actually just cause the base address of B to be shifted, may be by doing something artificial like this. What have I done? What I have actually done is, I have instead of declaring A to be of size 2048, I had declared A to be of size 2052, B once again is of size 2048. What is the impact of declaring A to be of size 2052? The impact is going to be that, if the compiler was going to assign the base address of A is hex A000, and under the old scheme of things, it was going to assign the base address of B to be E000. With this new declaration of A, I have made the size of the vector A bigger by 1 cache block and therefore, if the compiler just goes ahead and uses the next address for the base address of B, then by default the base address of B would be hex A00. But it will be such that, the index will be one more then the index would have been with E000. In other words, if the index associated with hex A000 is 256, then by shifting the base address of B by 1, cache block by artificially making A 1 cache block bigger. In other words, 4 array elements bigger. I will be shifting the index address of B to 257 rather than the 256 that it was before. Therefore by artificially increasing the size of A, I have apparently caused the compiler to give an index for B, which is not going to conflict with the index of A. The base address of B will now be, yeah this is the number hex E020. In other words, it is going to have an index of 257. If you do the calculation of the index, take the hex address E020 and break it up into offset index and tag. You will find out that, instead of having the value 256, it has the value 257 in its intermediate bits. The bits from the not least significant 5 bits for the next 9 bits. So, that was the, that is apparently going to be the impact of changing the declaration of A in this small way. Now, I am not saying that the vector, the vector A is being made into a larger vector. All that we are trying to do here is, to artificially cause the compiler to give us a more favorable address for the vector B and I could have achieved the same effect using a slightly different mechanism. For example, I could have kept the size of the vector A at 2048. But between the declaration for A and B, I could have included, let us say 4 other double variables- x y z and w. All double variables, the effect could have been that, each of them would have been assigned in address, which was after the address of A. So, x would have an address of E000. Why would have an address of E008 and so on. And the net effect would have been that, B would have benefited it the same way to have an index value different from that of a right. So, the net effect now is going to be that, if I run the same program, but just making this one change, by causing the base address of vector B to be shifted by one cache block, then the performance of the program is going to be different and you could actually calculate by filling out the table. Once again you will find out that, I will get a hit a miss on A of 0 a miss on B of 0. But I will get a hit on A of 1, a hit on B of 1, a hit on B of 2 because of spatial locality of reference and once again I' will be back at a 75 percent hit ratio, which is what I could get in the case of the vector sum reduction. Therefore, this very small change in the program changes the program from one which had a 0 percent hit ratio to program which had a 75 percent hit ratio and this is something under programmer control and so, with this example. We have seen 2 examples of how some understanding of cache and some understanding of how the compiler works can be used to understand the way that our program behaves in connection with the cache. In the lectures that follow, we will look at a few more examples. I have stop here today. Just reminding you that in this lecture, we have seen 2 examples of analysis of the behavior of a simple program. The programs have been very simple. Just doing operations on vectors, simple loops doing operations on vectors. This allows to do a somewhat thorough analysis of the impact of the cache on the behavior of the program and in the second program we found that, by understanding what was happening we could quite easily find mechanisms through which we could substantially improve the behavior of the program from a 0 percent hit ratio to 75 percent hit ratio and we proceed to look at four or five more examples in the next lecture. Thank you