Mod - 17 lec - 34 energy - Aware software system - Part 3

Welcome to the lecture part 3 of energy aware software systems. So, today we are going to look at Microarchitectural Techniques to save energy and move on to how compliers are going to be useful in saving the energy using the Microarchitectural Techniques. For example, the CPU and memory are the energy guzzlers in a computer system. So, on the CPU front, we can use voltage and frequency scaling, change the voltage and frequency of the CPU. We can use supply voltage getting of function units inside the CPU. We are going to look at each of these a little more in detail. Then, we can gate the clock of function units and we can do some bus encoding for the buses inside the CPU. On the memory front, a new type of memory called the drowsy cache can be utilized. It has a facility to switch off power supply to several cache lines which are not active. Compression in the instruction cache is possible to save energy because if we compress instructions, then the space occupied will be much lesser. So, the access will be faster and less energy consuming. Cache region reservation and partitioning, scratchpad memory - these are some of the techniques which are possible at the Microarchitectural Level. What is CPU voltage scaling? It basically reduces the voltage of the CPU and this saves mostly dynamic energy of the CPU. It cannot change the static energy consumption of the CPU. In CMOS circuits, the delay inside the CPU will increase with reduction in voltage. So, we definitely have will be forced to reduce the clock frequency, if we reduce the voltage. This in turn reads to performance degradation; so, the challenges in voltage scaling would be to retain the same level of performance, but with reduced energy consumption. For example, inside the Intel XScale 80200 processor, voltage can be change from 1.0 to 1.5 volts in small increments. Automatically, the frequency will also change from between 200 to 733 Megahertz. In other words, the highest frequency may be 733 and the lowest frequency would be 200 in steps of 33 or 66 Megahertz. So, if we increase the voltage, frequency increases; if we reduce the voltage, frequency also reduces. Another important factor is the time for change in voltage. There is hardware inside the CPU in the voltage regulator, which has to stabilize its output when we want the voltage to be change from 1 step to another step; this requires a finite amount of time. This is quite large, for example, up to 1 millisecond. The reason is there are capacitances inside the voltage regulator; these require time to charge and discharge. That is the reason why this change in voltage requires some extra time. We will look at static voltage scaling which implies we change the voltage of the CPU right in the beginning of running the program and we do not change it during the running of the program itself. This Intel's speed step technology which detects if the system is plugged into a power outlet or is a battery, and accordingly runs the processor at highest voltage frequency or less power hungry mode, thereby using less voltage frequency it actually saves battery power. Then, there is dynamic voltage scaling. So, this is possible at task level and is done by the operating system task scheduler. So, of course, it is also possible to do it inside the program; we will see that a little later, but that requires compiler help. Modern day, embedded processors provide for dynamic voltage scaling through program instructions. For example, Intel XScale, StrongARM, AMD mobile K6 Plus, Transmeta Crusoe and Power PC 405 LP - these are some of the processors which have instructions to change voltage. So, when the program is running, it is possible to change the voltage of the CPU. It is also possible for the operating system to execute these instructions and change the voltage of the CPU. Then I mentioned CPU Clock gating. Basically, if you gating implies stopping, so the clock to the function units is stopped implies that the function unit cannot run at all. These are all synchronous systems; they require a clock in order to run. Make the function unit clock 0 during the idle period. So, if the clock is 0 during the idle period, then the clock energy would be reduced. It also reduces the dynamic energy of the function unit, because the function unit cannot run when the clock is zero. It reduces dynamic energy usage, but the static leakage remains. Not much can be done about it unless we cut the voltage of the function unit itself. The change in the energy due to clock gating is not really great, but it definitely helps; we need to save energy in every possible way so that the total sum of energy saved due to several techniques is reasonable. Both circuit-level and compiler level techniques are possible in this case for gating the CPU Clock; we will see that a little later. So, supply voltage gating is another possibility for function units; what we saw before was supply voltage gating for the whole CPU. Now, are rather supply voltages scaling for the whole CPU? We are looking at supply voltage gating of function units. So, function units are switched off during idle periods. So, this implies that even the static energy is going to be saved and circuit level automatic techniques have been proposed for this purpose. So, they have a mechanism to sense idle periods. They see how many cycles the function unit has been idle and based on that the algorithm switches off the supply voltage to the function unit. Let us say two cycles of inactivity, they trigger switching off of the supply voltage to the function unit and then it remains in that state until the function unit is required again. So, once it is required, the supply voltage is reactivated and the function it becomes functional. But both switching off and switching on requires some finite amount of time. Now, that has to be factored in when we want to actually use this technique in the architecture or through the compiler mechanism. Now, let us look at compiler techniques to save dynamic energy in CPU. The first one is not strictly a complier technique, but has been mentioned because it is widely referred to in the literature. We accumulate the work load history, what is the load on the CPU, for how many and if during this particular interval the CPU has not been working at its best. For example, its activity factor is probably less, may be 0.5, 0.4, and 0.6. In such a case, it is possible that by running the CPU at a lower voltage and frequency during the next interval. We are going to save some energy, but at the same time, stretch the activity factor to 1 and make sure that the CPU itself, rather the program itself does not suffer in performance. If this requires direct intervention of the operating system and or application, it in interval based techniques. It considers, as I said, the idle time on a previous interval as a measure of processor's utilization for the next interval. So, the assumption is whatever happened in this interval will happen in the next interval also. Use this to decide on the voltage frequency settings to be used throughout the next interval. It is also possible to use a moving average of the previous intervals and this came possibly give some better advantage. Inside the compiler, how do we incorporate a dynamic voltage scaling technique? The basic idea is to make different parts of a program, operate at different voltage frequency pairs. During CPU stall, that is when memory operations are happening and the result of the memory operation is required for the next instruction to execute and therefore the CPU is doing nothing. During such a stall, it is possible to scale down the CPU voltage and frequency, save energy, but performance will not be degraded because CPU is really doing nothing. So, we can reduce its voltage and frequency. Let me show you a picture to make this very clear. The memory is supposed to operate at a constant voltage and frequency. So, let us say, the memory operation is going this way and it requires T mem cycles to finish. The CPU operation is happening. So, its starts here; it goes on until this point. Then, until it gets the memory operand, it can do nothing. So, it stalls and waits until the memory operation is completed, receives the memory operand and then continues its operation. Now, there is nothing wrong, if we stretch this period - which is actually the useful activity region of the CPU - up to this point by reducing the voltage and frequency of the CPU; so, because during this period, the CPU is doing nothing. We can stretch this period up to this period, make the CPU run at a lower speed and thereby we actually during this entire period anyway CPU was active. Now, we have reduced the voltage; the time remains the same; so, we have saved energy. At this point, we increase the CPU speed, the frequency and voltage. So, that the performance is not degraded. This is the basic idea behind the voltage scaling. Memory operations are assumed to be asynchronous. So, they go on independently of the CPU and they always operate at the highest voltage frequency. It is the CPU whose voltage and frequency will be change. Schemes of this kind have been shown to provide 0 to 25 percent energy saving with 0 to 3 percent performance loss. It is possible that the whole program has to be run at a single frequency - the highest possibly and it may not be possible to divide every program into couple of regions which run at different frequencies. That is the reason why, the savings vary between 0 to 25 percent. Now, how does this perform? So, we try to partition a program into regions based on the consumption of energy at different voltage and frequencies. It has been shown, that two regions are usually sufficient: one of the region operates at a lower frequency and the other operates at the maximum frequency. Introduce the frequency-changing instructions at the entry and exit of the lower frequency region. In other words, when the program control enters the lower frequency region, the frequency is changed to a lower one and when it exits the lower frequency region, the frequency is changed again to the higher one. Finding best partitions in this fashion is an optimization problem. We need to factor in the frequency change timing: 100 memory accesses may be needed 10 to 20 microseconds depending on the processors frequency. In order to perform this change, I already mentioned that, this is due to the voltage regulator and its capacitances. What is a region? All top level statements inside a region R are executed the same number of times. This is the rationale behind dividing a program into regions. So for example, a loop nest is a region, call site, a procedure sequence of statements or entire program; they are all regions. So, this is an informal description of a region, but how do we use this. So, what we really do is, let us assume, that we have built or rather partitioned the program into 2 regions. Now, given that there are 2 regions: there is a region R and there is a whole program P. There is a constraint on the size of the region R. In other words, T (R, f max) is the time required to run the region R at the highest frequency f max and T (P, f max) is the time required to run the whole program at f max. So, T (R, f max) divided by T (P, f max), the ratio of these is required to be greater than or equal to rho, which is usually about 20 percent. Why do we want this constraint? The point is, if we have a very small region, then the number instructions executed in that region will be very small. So, the time that the CPU spends inside the region R will be very small. If the region is very small, then the overheads of the switching; that is we have to switch to a lower frequency and then again to a higher frequency at the borders of the region R. This overhead may become quite substantial. In order to make sure that this overhead is not very high, we make the region of a reasonable size. If the region is very large, then the energy benefits are going to be very low. We have to make the region of reasonable size somewhere in between. So, in the optimization problem, there is a constraint on the region which is set by experimentation for each processor is could be different. So, we set it to approximately 20 to 25 percent. What is the minimization problem? So, it says minimize over all regions R and all frequencies f. We have to find the region R, let us see later how to find it. For a given R, we need to experiment with all possible frequencies f. What is it, which we need to minimize? The energy: total energy of the CPU. The energy of the CPU or the program is in 3 parts: the first part is powers spent by the region R, at frequency f multiplied by the time spent by region R, at frequency f. So, power into time is energy. This is as far as region R which is executed at lower frequency f. Then, we have the region P minus R, which actually operates at the highest frequency. So, the power consumption, at this level is P f max and T of P minus R comma f max, is the time spent in the rest of the region in the program; that is P minus R. The last part is the energy spent for switching. So, P of trans into 2 into N of R. N of R is the number of times the region R is executed. There is a factor 2 here, because we need to switch twice. We need to switch to the lower voltage, when we enter the region and we need to switch to the higher voltage, when we exit the region. P trans is the power consumed during switching. This is the total energy consumption for the program. This must be minimized over all possible regions and frequencies subject to a constraint. The constraint is the total time required to execute the program should be below a certain threshold. So, r is the performance degradation tolerated, let us say, 5 percent. T (R, f) is the time required to execute region R, at frequency f. T P minus R comma f max is the time required to execute the rest of the program - P minus R, at the frequency f max. Then, we have the switching time. T trans into 2 into N R, the time required for switches. This entire thing is the time for the program with a lower voltage and frequency selected for a region. This time should be less than or equal to 1 plus R into T of P comma f max. So, assume that we did not do any reduction and voltage for any region in the program. Then, the time required would be T of P comma f max at the highest frequency. So, we are willing to spend a little more time. So, that is why the 1 plus R factor. So, 1 plus R into T of P comma f max, is a little more than T of P comma f max and we do not want the degradation to be more than this particular specification of r, say, 4 to 5 percent. How does one implement dynamic voltage scaling? So, there is going to be code instrumentation for very basic regions, such as: call sites, loops, if then else and so on. The instrumentation measures, the time R comma f. So, assuming that the region is small, we require a very high precision timer. So, T(R, f) is the time required to execute region R, at frequency f and N R is the number of times region R is executed. This can be measured using a counter. Code instrumentation is needed to start and stop the timer and also increment the counter at the various basic region levels. P of f max is the maximum power consumption - this is steady power consumption; let us, assume that. So, if once we know that the power consumption is steady, we can either measure it directly using a wattmeter connected to the power supply or we could use a simulator such as, Wattch to tell us what is the power consumption of the processor? Power consumed at various frequencies to simplify the matter is obtained through interpolation using P of f max. So, change in frequency is assumed to be proportional to change in power. So, we are going to use that equation in order to interpolate and find the power consumption at various frequencies. If this is not going to be linear, then it is possible to change the frequency, run the program at that particular frequency and measure the power. We could do this for a couple of frequencies and then fit a curve and calculate the power consumption at other frequencies using this fit at curve. We have basic regions, but we do not want such very small regions - regions have to be of reasonable size. So, we must provide a simple composition mechanism for combining basic regions. So for example, we could combine different segments of the program into one using program composition rules. So, the then part and the else part could be combine into 1; 2 loops could be combine into one and so on. These are the usual programs syntax composition rules that we are going to use for producing bigger and bigger units. So, in the same way as a parser, starts from the smallest expression and builds up parse tree for the whole program. We are also going to use the same rules; combine smaller basic regions into bigger regions. Then compute the power and time requirements of the composed regions. All the region combinations are enumerated and the optimal one is chosen. This is not going to be too large. So, we can afford to do this; otherwise an integer linear program or some other formulation can be utilized as well. So, that was about changing voltage during run of the program, rather using run of the program, but determined by the compiler. So, what happens if we have a dynamic compiler, in other words, for example, a just in time compiler for java, is also a form of dynamic compiler. So, what happens in such a compiler, there are java byte code instructions, which are actually translated into machine code during the interpretation of the java byte code. Therefore, there is no separate time available for us to analyze the program like we proposed in the previous scheme. Then, insert instructions to change frequency voltage etcetera. It is also possibly superior to change the frequency and voltage during by accumulating information during the run of the program, rather than by the compiler. So, just in time compilers and dynamic compilers try to do this. They try to accumulate information about, which methods are very hot or much executed very frequently using simple profiling methods. Then, they compute reduction in frequency as proportional to the relative CPU slack time. The time required for accessing memory minus the overlap time, which is a useful activity of the CPU - that is, the overlap. So, T mem minus T overlap is proportional to the CPU slack. So, divided by T total is really the ratio, which is used as a measure of the slack of the CPU. This relative CPU slack time is used to compute the reduction in frequency. Let us see, how T mem and T overlap are computed, we cannot be using very heavy instrumentation to compute this. We are going to use the performance counters available inside the CPU, which actually have no overheads at all to compute the CPU slack time. For example, T mem by T total can be computed as k 1 into number of memory bus transactions divided by number of micro operations retired. So, both these number of memory bus transactions and number of micro operations retired are given by two different performance counters. So, what is involved is to reading those performance counters. k 1 is a constant which is set by some experimentation. Similarly, T overlap by T total is proportional to k 2 into floating point and integer instructions divided by number of micro operations retired. So, again there is a performance counter which gives us this information and also this information. We could compute a quantity which is proportional to T overlap by T total. So, computing the slack or relative CPU slack is a fairly straight forward process, once we compute these 2 quantities. So, remember these are obtaining from the performance counters so, this is a fairly straight forward operation. So, that was about using dynamic voltage scaling inside a dynamic compiler. Let us see, how dynamic voltage loops scheduling is done. Why should we look at a different technique for loops? The reason is twofold: one is, suppose, we have program which are not necessarily memory bound. So, the previous technique assumed that the programs are memory bound, that is, the CPU is waiting for a memory operation and therefore, it has very little to do; during this time we reduce the frequency and voltage of the CPU; suppose that was not so. The program was not necessarily memory bound. The other reason is, we could have multicore processors and we still have not seen a technique which can be used for multicore. This particular technique is can be used both on multicore and single core processor. The strategy is to repeatedly regroup a loop based on what is known as a rotation scheduling. I will give an example to show, what rotation scheduling is very soon. Decrease the energy by dynamic voltage scaling as much as possible within a timing constraint and this is not necessarily for a memory bound program. So, what exactly is rotation of a loop? Let's assume that we have a loop of this kind. There is some initialization here and then, we have a for loop with 5 instructions. Remember, that we have the order of these instructions A B C D E. So, loop rotation implies that we change the order of the instructions of the loop subject to the dependents requirements of the various instructions. So for example, this instruction one of them was taken upwards; so it is here. The loop started with the second instruction onwards and the A i instruction was introduced at the bottom. So, when the loop starts execution, this is executed for the first iteration and then, these are executed and this will be the instruction for the second iteration. We have changed the instruction subscripts also. It was A i equal to E i minus 3 star E i minus 3 whereas, now it is A i plus 1 equal to E i minus to the star E i minus 2; others have not changed. So, there is a prolog which of some size and then there is an epilog for the rest of the instructions in the last iteration. Because we have already executed one instruction in the last iteration, the rest of them have to be executed outside the loop. So, this is rotation of the loop by 1, because this instruction has come to the bottom like this. Suppose, we rotate the loop twice, then this instruction will also go up. We start with this instruction and B would actually be placed after A and there would be appropriate epilog here. So, this is the rotation of a loop. How does loop rotation help? Let us consider a 2 core processor: there are 2 processors; 2 CPU's and the hatched area indicate that the processor is running at the highest frequency in voltage. The black area indicates that the processor is doing nothing. So, to begin with for some program, not necessarily the program that I showed you just now; processor 1 is executing the 3 regions: A D and E, and then the processor number 2 is doing nothing for certain duration of time. Then, it is executing C and B; then again it is doing nothing. The number of time slots is really 12; this cannot be changed, this is a time requirement -time constraint. Suppose, we rotate the loop once, the question is, is it possible to actually do some dynamic voltage scaling and get some benefit out of it. Now, the processor 1 executes D and E. A is actually scheduled on processor 2. So, processor 2 executes C and B, and then, it reduces the voltage. That is why, there is a linear bar here, executes A in lean mode - that is, lower frequency mode and then, again changes the voltage back to its normal. Now, during this time, the program is really doing nothing on processor 1. During this time, there is some saving because we have actually reduced the voltage of the processor number 2. Let us do, loops rotation once more. Now, processor 1 does very little here; then it executes E; then it executes D and in between it does nothing. Processor number 2 actually runs the entire duration of 12 cycles. It executes in lean mode, at a lower frequency so, there is a substantial amount of reduction in energy consumption. If we do, let us say loop rotation once more. In this particular example, we are able to reduce run D on processor 1, then reduce the voltage and run E on processor 1. Then, the processor number 2 runs the other C B and A in lean mode. So, there is saving here and also here. So, in this fashion, we are able to reduce a substantial amount of energy. This only an example, which shows that the third rotation blimps out the maximum benefit; it is not necessarily show in all programs. It is possible that, only one rotation is allowed in certain loop; 2 may be possible in another loop; 3 may be possible in one more. The number of rotations is dependent on the dependencies in the program and is not necessarily always fixed. But this example shows, that loop rotation enables slowing down processors without reducing the amount of time. So the time is the same; power consumption is less, so, energy is saved. So, that is loop rotation and scheduling of the parts of the program, rather parts of the loop in order to save energy. Again, we must stress that we are saving dynamic energy and we are not really saving any static energy in this case. Let us look at a third type of processor. We saw a single core, multi core and now, multiple clock domain processors. What is an MCD processor? The basic idea is different parts of a CPU will run at different clock frequencies. Why is this needed and how does it solve any problems at all if a t. In general, this type of MCD processor addresses difficult clock distribution and power dissipation problems. So, it is well know that in a large CPU or large chip, in general, supplying the same clock to all parts of the CPU is very difficult; it is not easy. It is simpler if separate clocks are used in different parts of the CPU, but then clock synchronization between these is a very difficult problem. So, it is not easy to use such processors. Now, why should we actually have large chips? Obviously, the larger the chip, more chip area and more electronics that can be put on it. So, power dissipation becomes very large if we have a very large chip. We are going to have smaller regions of the chip, which have different clocks. So, the clock distribution electronics and the CPU area needed for clock distribution itself, increases power dissipation. So, if we use local clock in different parts of the CPU; this clock power dissipation also comes down. So, we are going to partition the chip into domains. So, each partition functions at an independent voltage and frequency, so that is the assumption in an MCD processor. It needs synchronization circuits between for inter-domain communication such as buffers, queues and so on. Domains, whose performance is non-critical, can operate at lower frequencies and therefore, they will consume potentially less power. Performance costs for synchronization are not very significant for out-of-order processors. So, if somebody builds MCD processors; it may still be useful. Here is an example, so there is an MCD Microarchitecture. So, there is a front end which does fetch and decode; so, this is one domain; memory itself main memory is another domain. The L1 data cache, load queue and L2 cache form one more load domain, then, there is a floating point domain and integer and register domain. So, this is a 5 domain processor. How does one use the compiler technique for doing the DVS on such processors? Basically, the technique builds a Timed Petrinet based program performance model; this Petrinet model is parameterized by microarchitectural settings and resource configurations. I cannot provide a tutorial on Petrinet for lack of time, but it surprises that a Petrinet can model concurrency; it can model nondetermine some and so on. This model is used to evaluate the performance impact of various frequency settings for a program region. During this evaluation, we really do not have to run the whole program, but it is necessary to just run the program. It rather run the simulation of the Petrinet and watches the performance of that processor using the Petrinet model for certain duration. So, using this duration based simulation; it is possible to evaluate the performance of the CPU at various frequencies. So, we choose a low frequency setting with acceptable impact on performance this impact can be specified by the user. The advantage of this particular scheme is that, it uniquely and directly estimates performance impact of a lower frequency setting instead of relying on weak indicators of performance. So, directly we try to simulate for certain duration and measure the performance. The results available for MCD processors using this Petrinet model based technique. So, for the SPEC FP benchmarks: there are many L2 misses in this bench mark. So, energy delay product improvement is very large. Our technique uses saves 60 percent energy, while meeting performance constraints; whereas, existing hardware profile-based dynamic voltage scaling techniques save an only about 34 percent. Media Benchmarks have almost no L2 misses; so here using ED has a metric does not help much; so, we use the ED square a measurement metric and improvement in our case, is 22 percent whereas, in the case of profile based hardware technique is about 18 percent. In general, hardware based DVS saves less energy; relatively and it is better in media benchmarks and whereas, the queue occupancies of floating point and the load store domain are very low in media benchmark. That is why; it gives slightly the profile based hardware technique gives much better results. So, our technique is found to be very good if there are many misses L2 misses such as in the SPEC FP bench mark. Now, let us start looking at Compiler Techniques to save Static Energy in the CPU. So far, we have seen techniques which saved dynamic energy. So, we did not see any techniques to save any static energy. Function unit voltage gating is one of the forms, which can save this static energy; that is supply voltage to the CPU for the functionality switched off. Why should we do this? I mentioned this in the introduction: Leakage energy is the static dissipation energy in CPU cache etcetera and function units are in active state, but are not doing any useful work. That is the time when we actually want to save energy. With the 70 nanometer technology, leakage energy consumption will be on power with dynamic energy consumption, whereas, above 70, say, 90 around 35, older technologies dynamic energy consumption was the maximum; rather it was actually the major contributor to energy consumption whereas, static energy consumption was very low. With lower level technologies say 65 and then 32 etcetera. The leakage energy is going to be high. We need techniques to save leakage energy, because whether we do anything useful in the CPU or not; it will still consume as much energy as when the CPU is really doing something useful. Dynamic energy consumption and static energy consumption will be on power. So, each will contribute 50 percent of the energy consumption. So, we need to save both of them in order to get the maximum benefit. So, there is something called dual-threshold domino logic with sleep mode. This is an electronic circuit and this can facilitate very fast transitions between active and sleep modes. In other words, when a function unit is in active state; the full voltage is supply to it and when it is in sleep state; the voltage is made 0. So, this is the transition that we are talking about between active and sleep state. But the nice thing about these dual-threshold domino logic circuit is; they can actually switch between these active and sleep states with very little performance penalty; very moderate energy penalty. So, observe the words, there is not much performance penalty, because we are going to switch within 1 cycle. So, active to low is 1 cycle; low to active is another cycle, but the once the switching time is very low, energy consumption becomes high. There is moderate energy penalty for such switching. Now, the problem is if the switch requires a fair amount of energy, then is it useful to put the CPU in sleep mode very frequently; obviously not. So, the 1 cycle of idleness is all that we required putting it into low leakage mode, and we also know the factor that motivates us is, integer ALU's are known to be idle for 60 percent of the time on the average. That is, the integer ALU's to very little for 60 percent of the time in general. There are two schemes: one is a pure hardware scheme proposed by Dropsho. So, they have these dual threshold domino logic circuits. They watch the CPU rather the ALU for 1 cycle and if the ALU is idle for 1 cycle or more; it is put into sleep mode. Automatically, the hardware takes care of switching from low to high and high to low. The problem is, there is 26 percent energy overhead, over the ideal scheme with no overhead. The switching causes 26 percent overhead; assuming over the scheme which assumes that there is no overhead. There are very frequent transitions between active and sleep states. So, if software-based scheme such as a compiler based one aids the hardware; then together they can actually save much more energy with very little performance loss. What is it that we are trying to do here? This technique can be used both for voltage gating and also clock gating of the CPU. So, the CPU is active for some time. Now, we try to bunch instructions, which use the same CPU rather same function unit. So, that the active and idle periods function units are increased; let me elaborate. The function unit is executing some operations. Now, in the program which has not been transformed by any compiler technique, it is possible in the worst case that there are 1 or 2 ALU operations. Then, there is an idle period of 1 or 2 cycles; another ALU operation another idle period and so on. And because the idle period is more than 1 cycle; the hardware switches it to low state and then again to high state, whenever the FU's needed. Suppose, the ALU instructions which are executed after idle period - two consecutive idle periods are not dependent on each other. In such a case, we can actually push these ALU instructions together, rather the CPU instructions together so, that the function units which was busy in executing 1 or 2 instructions. Now, continues to execute the next bunch of instructions also, because there was no dependency. The instructions scheduler brought them together. So, the ALU now keeps itself busy as long as possible. That is, as long as there is no dependency on the next instruction and it is force to wait. So, assuming that there was an idle period between consecutive two instructions, now there will be no idle period between these two instructions. The other advantage is, because we have made the function unit busy for a longer time. The idle periods actually add up together so, idle periods increase and the busy periods also increase. That means the switching between the active period and idle period has reduced. So, if there were 4 or 5 switches between; now, it probably reduces to 1 or 2. So, there by, the number of switches reduces and the energy consumed during this switching will also reduce. So, the CPU uses supply voltage clock gating during this idle periods, that is, done by the hardware whereas, the rearrangement of instructions so, that the instructions which use the same function unit are bunch together is done by the compiler. So, this leads to better benefits and it saves transition energy in the CPU as well. So, we will stop the lecture at this point and next time we will look at details of the instructions scheduling mechanism which saves energy to a large extent. So, this is based on the same list scheduling strategy that we have discussed before, with the heuristics for priorities etcetera being different. Thank you.