Tip:
Highlight text to annotate it
X
Welcome to part 2 of the lecture on object oriented languages and their implementation.
Last lecture - covered topics from 1, 2, 3 and 4 that are language requirements, how
to map names to methods, visibility of variable names and code generation for methods. Today,
after a very quick review of the code generation we will look at simple optimizations.
We saw this picture in the last lecture, it just shows example of a class hierarchy with
complete method tables. So, the idea is to make the method invocation faster, each class
stores a complete method table. Whenever, an object wants to invoke a method, the method
table is looked up and the offset of the method is applied and then the method is called.
So, this works whether the method is inherited or the method is implemented right, in the
present class.
There are no problems in implementing single inheritance. It is very simple but, implementing
multiple inheritance - that is, a single class inheriting from two other classes; for example,
class c inheriting from class a, class b creates some difficulties.
For example, let us say the object records of class a have just a data members, class
b have b data members but, class c which inherits both from a and b has space for a data members,
b data members and c data members. The difficulty is in calling methods and how to address this
difficulty is what we studied in the last class. Let us review it very quickly.
Now, the difficulty is when we are actually calling some methods from class c if the methods
belong to class a, then there is no problem because the members of class a are in the
right position. If the methods belong to class c, then also there is no problem because class
a and b have been compiled with class c; the problem starts, when we invoke methods of
class b from objects of class c. The problem is when we invoke methods of class b from
objects of type class b, the data members of class b are supposed to be right here,
in place of a data member but, they are now at an offset.
We need to actually provide this offset information; this can be done by 2 methods. We can store
the constant offset in the method table along with the methods and the generated code adds
this offset to the receiver's pointer address before invoking the method.
This is what I was saying, so the data pointer for c dot foe should be here, because it belongs
to class b if it points to here or here it is wrong
The other method is using trampoline function. So, these are functions created for each method
of class b and they increment this pointer by the required offset and then invoke the
actual method and of course, there is a corresponding decrement after the return.
So, there is nothing much to choose between the 2 methods. It may appear trampolines are
very expensive but, they are not because they can be inlined and the offsets are all constant
and so on and so forth.
We now look at simple optimizations, which are required in implementing object oriented
languages. First one is the type inclusion tests must be made very fast, let us see why?
If class Y is a subclass of class X, let us assume so. Then, suppose we declare an object
a of class X and initialize it with an object of type Y; so class Y is subclass of class
X, a happens to be base class pointer and this is a object of type Y and there is nothing
wrong, this is perfectly ok. So, a base class pointer can point to any
derived type objects, then we can also say Y b equal to a. We are declaring another variable
b of type Y and assigning the value a to it which is fine. If a holds a value of type
Y because of this previous one, a was pointing to some object of type Y. The assignments
above these two are perfectly valid; there is nothing wrong, but the following assignment
is not valid. Suppose, we declare a variable a of type X
and assign it an object of type X but then, we declare another variable b of type Y and
assign a. Now, this a was holding an object of type X and this b is of type Y; Y is a
subclass of class X, so you cannot assign b equal to a. The other way, when a actually
was new Y and it was pointing to some object of Y, this assignment would have been perfectly
ok; now it is not. When we make such assignments either allow
them or do not allow them. It is necessary to conduct runtime type checking to verify
that the above is correct or this assignment is needed and so on and so forth, this is
the problem. This cannot be done at compile time, because there could be conditional code,
which cannot be completely evaluated at compile time.
So, Java also has an explicit instance of test that requires runtime type checking.
This can be only checked at runtime. The type of b is known only at runtime if there is
conditional code and all that. Type of a will also be known only at runtime; whether this
b and a are compatible actually is to be checked at runtime.
How do we do such tests? Whether you know, a class, an object of a particular class is
correct in making this type of an assignment is to be checked. So, to do this we need to
check whether something belongs in the class hierarchy; whether a is a subclass of b or
b is a subclass of a etc.
There are many methods; we will look at one or two of them. We can store the entire class
hierarchy graph in memory. Here is an example, of how a class hierarchy graph looks like;
this is for single inheritance, this is for multiple inheritances. A is the root, so there
is nothing beyond A; A is not going to derive from any other class. B and C derive from
A, D derives from B, E and F derives from C, and G derives from E. This is our class
hierarchy graph for single inheritance. For multiple inheritance, there are differences.
For example, E derives from B and C, F derives from C and H, C derives from A and H, G derives
from D and E etc. So, this is the multiple inheritance part. Now, if you want to check,
whether one node is an ancestor of the other; that is A is an ancestor of D, C is an ancestor
of E, C is an ancestor of G etc; B is not an ancestor of F, so these are all to be checked.
This is very simple to do, we anyway have a pointer to the class record from the object
and we have this class hierarchy graph. There will be a pointer in the class record
to this particular node in the class hierarchy graph We can go on traverse in the class hierarchy
graph till the root and check whether one of the ancestors is met the way we wanted
it. So, if you want to check whether A is a an ancestor of G, start from G and go upwards
of course, we meet A; when we go to the till the root and we do not find any of these on
the way then, if we did not find A, A would not be an ancestor of G. For example, we do
not find B when we traverse from G towards A. This is a fairly straight forward traversal
to implement only for the single inheritance, if it is multiple inheritance there is too
much checking and it is not a very profitable issue.
For example here suppose we want to check whether A is an ancestor of G. You know here,
we need to check all paths from G towards A and we do not know which one of these go
towards A and which one of these do not go towards A. So, we will have to check maybe
G D B A, we will be very fortunate if we take this path, because we hit A right now and
we can terminate the search; but if we had taken G E C and H we would have wasted a search
and then we had to check G E B and A or maybe G E C and A and so on and so forth.
So, with a large inheritance in class hierarchy graph, checking inheritances in a multiple
inheritance environment is not straight forward. You may actually have to do a large number
of checks before we know whether it is true or false.
The execution time both for single inheritance and multiple inheritance increases with the
depth of class hierarchy, this is clear. If there is very deep hierarchy, then you may
have to go all the way till the root to determine whether some tech node is an ancestor of some
other node and this ancestor relationship determines the class derivation or otherwise.
It is also possible to use a binary matrix. For example, consider this matrix you have
class types on one side, class types on the other side - C1 C2 C3 C4 C5 and on the other
side also. So, BM is a binary matrix of [Ci ,Cj] equal to 1, if and only if Ci is a subclass
of Cj. If you take this row, C2 is a subclass of C3; that is why it is 1 but, C2 is not
a subclass of C4, so it is zero. C2 is also a subclass of C5 so it is 1, so there is multiple
inheritance. So single inheritance, multiple inheritance - all types can be actually shown
using the binary matrix. These tests are quite efficient to check but,
matrix will be large in practice. So, if there are a large number of classes in the matrix
and the hierarchy is large then, this matrix will also be very large. The matrix can be
compacted but, this increases the access time. This is quite fast efficient but, the matrix
is large. So, are there other solutions possible?
Fortunately yes, that is for single inheritance the solution is very very efficient one. Let
us consider a class hierarchy of this type so A B etc, etc, etc this is a tree. Now let
us say, we conduct a post order traversal of the tree. So, what we do is we first visit
D and then we have to visit B and then you know we actually cannot go to A but, we have
to come all the way to G, then we go to E, then we go to F, then C, then A. In other
words, this is 1, this is 2, this is 3, this is 4, this is 5, this is 6 and this is 7 These
post order traversal numbers are all indicated as the second component of these pairs, the
number pairs, which are indicated on each one of the nodes; so these are the labels.
The labels are (la, ra) for a node A; ra is the ordinal number of the node A in a post
order traversal of the tree. Let this, solid less than equal to denote subtype of relation.
So, all the descendants of a node are subtypes of that particular node. This is similar to
that class hierarchy graph, so single inheritance only. This subtype is of reflexive and transitive
definitely not symmetric, because the subtype of relation is not symmetric. So, it is definitely
reflexive because A is a subtype of itself and transitive because if A is a subtype of
B, B is a subtype of C and then, in this case for example, B is a subtype of A, D is a subtype
of B so D is a subtype of A, so it is transitive. Now, what is the first component? la .So,
this is the minimum post order traversal number among the descendants of that particular tree.
It so happens that in this case, this is 1, right; so we start this is 1, and this will
get a number 1If we had taken this particular node, the minimum number is 3 so that, left
component becomes 3 and so on and so forth. That is how this is numbered, so la is a minimum
of rp where p is a descendant of a. You take the entire post order numbering, take the
minimum of all those descendants post order numberings and give that as the la of that
particular node. Now, the subtype of relationship can be very
easily determined using this condition: if a is the subtype of b, if and only if lb is
less than or equal to ra less than or equal to rb; these are numbers and these can be
checked in one time very simple, only 2 tests Let us verify this; say for example, we take
A and we take G. So, G is a subtype of A that is what we need to determine. So lb is 1 here,
ra is 3, so 1 is less than or equal to 3, r a is 3 and rb is 7, so ra less than or equal
to rb is also satisfied If you are trying the other way, whether A is the subtype of
3, it is automatically not satisfied because in that case lb is 3 and ra is 7, so 3 less
than or equal to 7 is satisfied but, 7 less than or equal to 3 is not satisfied. So, this
makes the inequality not satisfied and violated, so the relationship is not true.
This test is very fast it is O(1) but, it works only for single inheritance. There are
many extensions of this using slices of these hierarchy trees and implementing them using
what are known as PQ trees etc. These are far more complex and beyond the scope of this
particular lecture.
The second simple optimization which we need to do, the first one for single inheritance
was that Schubert's numbering or relative numbering that we studied. De-virtualization
is a very simple optimization but, not so easy to implement very efficiently. If it
is implemented in a very simplistic manner, it is simple, very efficiently is a difficult
task. So, using class hierarchy analysis this is we are going to study only one method.
What is de-virtualization? It reduces the overhead of virtual method invocation, how?
We try to statically determine which virtual method calls resolve to single methods. So,
if there is a virtual method, we want to see if that can be replaced by a static call.
Then, we do not have to really use the virtual table, method table, and call the method at
all; we can actually replace it by a static call by putting the name of the class and
the method directly. Such calls are either in-lined or replaced by static calls, so that
is whatever resolves to single method. I will show you an example of this and we build a
class hierarchy and a call graph along with it.
Take this simple example; object is the Java super class from which all classes actually
have to derive. So, X is a class deriving from object, Y derives from X and Z derives
from X. So, X has two functions, methods, here f1 and f2, Y has a method f1 which overrides
f1of class X, Z also has another method f1 which overrides the method f1 of class X.
Here is the declaration for class X and class Y; class Z also has the main, let us see what
this is. So, main declares an object a of type X and assigns an object of type X to
it. Similarly, b is a method or an object of type Y and is assigned an object of type
Y and similarly, C is an object of type Z and it gets a new object of type Z. So these
are the initial assignments. Now if some condition is true, then a is assigned
c; so this is perfectly valid because X is the base pointer and C is the derived class
pointer, so a equal to c is perfectly justified. Now there are three calls here and let us
see, what they mean? So a dot f1, b dot f1 and b dot f2.
So, this is indicated in the form of the calling graph here So Z dot main as a dot f1, b dot
f1 and b dot f2. So b dot f1 is always y dot f1, there is no other function at all and
here, there is no question of, b does not point to any other object either. b dot f2
is always x dot f2, because you know b is an object of type Y and Y does not have f2,
so it always resolves to the f2 of class X, but the situation is not simple with a dot
f1. If this code were to be false then a would
be pointing to an object of type X. In that case, it would be when a dot f1 is invoked
it invokes x dot f1. If this condition were to be true, then a gets a pointer to c and
therefore, in that case a dot f1 will mean Z dot f1, which is actually local to the class
Z. So, there are two branches of this kind; so this particular call does not resolve to
a single method it has2 possibilities; whereas, these two calls resolve to single method.
We can replace, b dot f1 and b dot f2 by Y dot f1 and X dot f2 respectively and that
actually takes care of de-virtualization here; but a dot f1 cannot be de-virtualized.
There are many more complicated schemes available for de-virtualization but, that will be beyond
the scope of this lecture. So, we will conclude object oriented language implementation at
this point, this is the end of this particular lecture and we will continue with introduction
to optimizations very soon.
Welcome to the lecture on Introduction to Machine-Independent Optimizations. So far
we have seen, code generation runtime organization and then how to implement, register allocation,
object orient languages and so on. Now, we start looking at machine independent intermediate
code, look at a large number of optimizations that can be implemented on these and then
see what information is needed in order to implement such optimizations.
The outline of the lecture is very simple but, we are going to see a large number of
examples to begin with and then study the algorithms for such optimizations. We will
see what is code optimization? What are the different types of code optimizations? Then
look at illustrations of many of these code optimizations.
Many of you may recognize that this slide is from the introductory lecture that I gave.
So, machine independent code optimization, intermediate code generation process introduces
much inefficiency. You know it is a very simple strategy that we use for intermediate code
generation; there are extra copies of variables, using variables instead of constants, repeated
evaluation of expressions all these happens; everything that you think is very inefficient
will happen during intermediate code generation. Now, the code optimization phase or machine
independent code optimization phase is supposed to remove such inefficiencies and improve
the code. We will see examples of how this can be done very soon; this improvement may
be time improvement, space improvement or power consumption improvement, any one of
these can be improved but our emphasis in this lecture will be on improving time.
It changes the structure of the programs sometimes, beyond recognition also. For example, the
some of the functions may be in-line, so you will not even have a function you had written
one but, it does not exist anymore. It unrolls loops, so you had written i equal to 1 to
100, but now it may be running from i equal to 1to 25 and it maybe unrolled 4times. It
eliminates some of the programmer defined variables; for example, induction variable
such as loop counters maybe eliminated and some other variable in the loop maybe substituting
for that particular loop counter variable. Code optimization is essentially a heuristic
approach. It consists of a bunch of heuristics, so the percentage improvement depends on the
programs. A certain program, you may not be able to improve at all and for a certain type
of program you may be able improve it by 50 percent; so no guarantee is on how much improvement
can be made.
These are the various types of optimizations that we are going to see through illustrations.
It does not mean that, this list is exhaustive. There are many types of optimizations which
are probably minor and probably some of which can be carried out with only at the time of
machine code generation. We will limit ourselves to these optimizations and see how they work.
So, first one is global common sub-expression elimination, second is copy propagation then
constant propagation and constant folding, loop invariant code motion, induction variable
elimination and strength reduction, partial redundancy elimination, loop unrolling, function
in lining, tail recursion removal, vectorization and concurrentization, loop interchange, and
loop blocking. We have seen some of this optimization in different flavors before.
For example, we saw common sub expression elimination, but that was within basic blocks.
We know how to tackle it; we build a direct the cycle graph using value numbering technique
and constant propagation etc and constant folding can also be done within the basic
blocks using the same value numbering technique. So, let us continue and see what other types
of, how these things work and so on.
Code optimization needs information about the program. For example, which expressions
are being recomputed in a function? Which expressions are partially redundant? Which
definitions reach a point? partial redundancy will be clear very soon. Which definitions
reach a point and which copies and constants can be propagated? Such information is gathered
through data flow analysis, and our discussion of data flow analysis will begin after this
introductory lecture on optimization.
For a few examples, we will consider a standard program of bubble sort which everybody understands.
Here is a very simple version of that: i equal to 100, i greater that 1, i minus minus; so
here is the outer loop and there is another inner loop j equal to 0, j less than i minus
1, j plus plus. So, the array used is of 100 size and it runs from 0 to 99 and we have
not used any special jump out if array is already sorted; we did not used any flag and
all that. Here, if a j is greater than a j plus1 then, we do a swap operation otherwise
we just increment j and go to the next one. At the end of this entire procedure of this
j pass, we would have found the largest element and next during the next i pass we find the
next largest element and so on and so forth. So, that is really bubble sort.
Here is the intermediate code and the control flow graph for bubble sort Start with i equal
to 100, then i greater than1 is evaluated into t1, so instead of saying if t1 go to
B3 otherwise exit. We have said if not t1 go to B9 and if that is true go to exit, otherwise
if it is false you jump you fall through. Then here starts the j loop with j equal to
0 and the upper bound of the loop is i minus 1 it will check the upper bound if it is violated,
you jump out; if it is not violated, you continue with the rest of the program.
You can now see how much code gets generated just for a very simple check. It is not as
if things are trivial, but just to check whether if t5 greater than t8 etc; then here is aj
greater than aj plus 1, here is the swap routine So, just for doing some simple things like
this; we generate a large number of temporaries and then there are many opportunities for
common sub expression elimination, so this will be clear in the next example after we
see what GCSE is. The global common sub-expression elimination
conceptually is very simple. Here you have y plus z and it is possible that along all
the paths which come to this particular basic block There is an evaluation of y plus z and
there is no modification of either y or z along any of these paths. In such a case,
it is very simple to see that we do not have to evaluate this y plus z all over again.
You store it into the same temporary and then for example, u equal to y plus z , u equal
to y plus z and u equal to y plus z and use that u here. So, this is global common sub
expression elimination. It is also possible that, because of deep sub expressions a single
application of GCSE does not give results to the fullest extent possible; so, let us
see. Here is a equal to x plus y, then b equal to a star z and this has a path to c equal
to x plus y and d equal to c star z. So, x plus y is a very straight forward common sub-expression
which can be eliminated, we do that u equal to x plus y a equal to u and c equal to u.
So see that, we have used the same temporary for both for this assignment and for this
assignment Now, this a equal to u is a copy operation
Since a has not been modified, we can use u in place of a; so this can become u star
z and here also c equal to u is a copy operation, there is no modification of c between this
and this We can replace c by u and this becomes u star z. Now see, that because of the copy
propagation we found another common sub expression, this is u star z in both places - so we can
eliminate that also, introduce another temporary v equal to u tar z and d equal to v.
In this way, we may actually have to do several optimizations and then come back to global
common sub-expression elimination. So, the order in which the optimizations have to be
carried out is not well known. We do them in a particular order which is heuristically
known to be good again and again until no more improvement is possible.
So, coming back to the GCSE on the running example, let us see where the opportunities
are. Here is i minus 1 for example right, so we have i minus 1 here also; possibly this
and this are common sub-expressions. Here is 4 star j, here is 4 star j and here is
also 4 star j then 4 star t6 actually will become a common sub-expression only after
a round of copy propagation but, for the present let us worry about 4 star j, i minus i and
j plus 1 also. Here is j plus 1 and here is j plus1 one more instance of j plus 1.
Now, if we look at it, this t2 is equal to i minus 1 is evaluated once and this also
reaches this So, we can replace this i minus 1 by a simple t2 which was the previous evaluation
of i minus 1; t4 is equal to 4 star j, so t4 is going to replace 4 star j here and 4
star j here also; t6 is j plus 1, so t6 is going to replace j plus 1 here and also j
plus 1 here. This is very simple common sub-expression elimination on the running example.
Even in this simple example because of the inefficiencies in intermediate code generation
we have got 1 2 and 3, if we apply it once more we will very easily be able to do some
copy propagation. For example here so this t4; t12 is equal to t4, t13 is equal to a
plus t12, so t12 can be replaced by t4 and t14 equal to t6 so we can replace t14 by t6;
so this becomes 4 star t6 again. So, this is another 4 star t6, this is one more 4 star
t6, here t17 can also be replaced by t6, so this becomes 4 star t6; so both these can
be replaced by t7. In this manner, common sub-expression elimination can be done very
effectively even on this small example.
That is what we have shown here this becomes copy propagation; so we have eliminated the
copies that I showed you. Here, this became t6, this became t6, this became t4 and this
became t4 and so on and so forth This is copy propagation on the running example. We did
global common sub-expression elimination that was on i minus 1, 4 star j and j plus 1.Now,
4 star t6 is a common sub-expression that can also be eliminated but, that was obtained
only after copy propagation.
After the second round of common sub expression elimination the code becomes really efficient.
Please see this so we have replaced t7 here which is computed here .So all the common
sub-expressions are computed only once and used many times here; so now this is a compact
program.
What about the next optimization, so we have seen two of them - one is common sub-expression
elimination, the other is copy propagation. What about constant propagation and constant
folding? This is a very illustrative, nice example to show both. We have a equal to 10,
b equal to 20 and a condition if b equal to 20, then go to b3 otherwise jump out; so yes
come this side, no go this side. If it is b equal to 20 then a equal to 30, then come
back d equal to a plus 5 and stop. So, just observe that these are all constants
which are known to the compiler. a equal to 10, 10 is known, so this compile time assignment
is known; b equal to 20, so we know that by value numbering this b can take a value 20
very easy to find. Therefore, by checking b equal to 20 at this point, the conditional
constant propagation algorithm knows that it is true, because b already has a value
20. So, this at compile time will evaluate to true This is constant folding; we have
evaluated this particular condition. Now, as this is true, the no part can never be
executed. So this edge can be removed like this it is only this part, which gets executed;
the flow graph gets modified in this fashion. Now, a gets a value 30 and again it is a constant;
so we can replace this a by 30 If we had this flow graph, we did not know whether a would
be 10 or a would be 30 at this point but, since we know that this condition is always
true, this part does not get executed only this part gets executed. Therefore, this a
can get only the value from this particular assignment a equal to 30 This gets 35 and
then stops, so the program modified is like this, all the values have been computed; a
plus 5 is computed as 35, so that is an example of a constant folding.
So, b getting a value 20 here, checking b 20 equal to 20 is an example of constant propagation.
The value of a being replaced by 30, here from this part is another example of constant
propagation. Later, we are going to study how exactly all this can be done.
The next optimization that we are going to study is the loop invariant code motion. Take
this simple example, t1 equal to 202, i equal to 1, if i greater than 100, if t2 exit. So,
we are going to execute this particular loop 100 times. Otherwise fall through, decrement
t1; then compute t3 equal to addr a; addr a is a constant that is the base address of
array a and t3 equal to t3 minus 4, this is for some alignment. Then we compute 4 star
i, add t4 and t5. Now, we go to the location in the array a- it is a of i, which we want
to access; start t6 equal to t1 assigns to that particular location, the value t1 and
i equal to i plus 1 and go back to L1, so this the loop that we go on executing.
Let us look at some of the assignments here Obviously, the value of i varies here, so
t2 equal to i greater than 100 varies with the loop; the same is true of this because
t2 varies this particular condition also varies rather cannot be evaluated once for all equal
to t1 minus 2 also varies with the loop so it cannot be evaluated once for all; whereas,
t3 equal to address of a - address of a is a constant. The same assignment gets executed
100 times within this particular loop but, the value of t3 will always be the same.
So, it is possible to move this particular quadruple outside the loop, then nothing wrong
with that and such a computation which does not vary within the loop is called a loop
invariant computation and moving such a computation outside the loop is loop invariant code motion.
The next one is t4 equal to t3 minus 4, t3 is a constant, so t3 minus 4 is also a constant
and this can also be moved outside the loop, this is loop invariant code again. Once we
do this the loop becomes smaller. The idea of all this is if we remove many statements
from the loop, the loop execution time becomes lesser and lesser. We want to make the loop
as efficient as possible and that is the reason for loop invariant code motion. We try to
move statements from within the loop, which do not get modified during the execution of
the loop to outside the loop.
The next example that we consider is optimization example, which is strength reduction. So here,
this is the code which we had before, we have already done loop invariant code motion; so
that is why the code is in this fashion and these two are outside the loop. Here is a
statement t5 equal to 4 star i, so this actually uses a multiplication.
On very small processors and of course, in the olden days multiplication used to take
much more time than addition. For example, floating point multiplication takes a long
time whereas, fixed point multiplication takes much lesser time even today.
This particular optimization is called strength reduction because an operation which is expensive
is actually converted to a number of operations which are lesser expensive. Let us assume
that multiplication is more expensive than addition. So, this statement uses multiplication.
Let us see what we have done, we have set t7 equal to 4 and we have set t7 equal to
t7 plus 4 and in t6 which has t4 plus t5, we have set t4 plus t7. How is this correct?
If you observe the values that t5 takes i equal to 1. So, to begin with it takes the
value 4 then i equal to 2, so it takes 8, 12, 16, 20 and so on. In other words, the
value of t5 is incremented by 4: 4, 8, 12, 16 etc in every iteration of the loop, it
is incremented by 4. Instead of multiplying i by 4, why don't we simply take another variable
and add 4 to it. Thereby the effect is the same, t7 equal to t7 plus 4 the values of
t7 they start with 4 just like t5 starts with 4, t7 starts with 4 and then it becomes 8,
12, 16, 20 etc just like t5 becomes 8, 12, 16, 20 etc. There is no difference between
the sequence of t5 values and the sequence of t7 values; in every sense we can replace
t5 by t7 and that is precisely what we have done.
In this process, we have replaced the more costly operation of star by a simpler operation
and a less costly operation of plus. Even today on very small embedded processes, multiplication
may still be more expensive than addition, because multiplication may typically be done
by a subroutine whereas, addition is a hardware instruction. Strength reduction and then of
course, we also have done a little bit of copy propagation ; instead of t5, so we would
have actually set t5 equal to t7 here because they are equivalent and then t7 value can
be replaced for t5 in this particular statement We have done these two and replaced this t6
equal to t4 plus t5 by t6 equal to t4 plus t7 and we have removed the statement t5 equal
to 4 star i. This is strength reduction, it requires a little bit of code for initialization
and then perhaps even copy propagation.
The next example is that of induction variable elimination, again the previous example we
carry forward. So, before the induction variable elimination, let us say i is the induction
variable we want to eliminate. Why is i called an induction variable? You see the values
of i. For example, if you look at it, they start with 1 and then go on in an arithmetic
progression 1, 2, 3, 4, 5, 6 etc. Therefore, all such variables which progress
in arithmetic progression are called induction variables. The purpose of i in this particular
loop is only to count the number of times the loop is executing and then terminate the
loop, once it crosses this particular loop of 100.
If you look at the loop a little more carefully, there is another variable t7 which actually
also progresses in an arithmetic progression. We start with 4 and then go to 8, 12, 16,
20 etc this is also an arithmetic progression. It is just that it is not the same as i but,
it is 4 times the value of i. So, induction variable elimination tries to replace i with
t7 and remove i; that is possible in this case, only thing is in that case instead of
i we must use t7 and once we use t7 the test i greater than 100 should become t7 greater
than 400. So, with these changes it is possible to remove the induction variable i, change
the test to t7 and then the loop becomes even smaller. This is the loop that we get after
induction variable elimination
Here, is a bigger example of induction variable elimination and strength reduction and this
is from the dragon book. Here are two loops and a third loop which encompasses both of
them. This is the outer loop and two inner loops So, it is very easy to see that this
i is a counter variable here, j is a counter variable here but, it so happens that i is
a counter variable in the outer loop and also j is a counter variable in the outer loop.
So i is a basic induction variable and then t2 depends on i, see it is 4 times i as in
the previous case; so if i goes as 1, 2, 3, 4; t2 goes as 4, 8, 12, 16 etc. Such variables
t2, which depend on i are derived induction variables. There are conditions to be checked
for basic and derived condition for induction variables which we learn later. But, for the
present it suffices to know that i is a basic induction variable both in this loop, inner
loop and the outer loop, t2 is a derived induction variable both in this loop and the next loop.
The same is true of j, it is not necessary that the values of induction variables only
increase; they can monotonically decrease also. j starts with n and reduces n minus
1, n minus 2 etc and t4 derives from j, so t4 is 4 times j and it reduces in the same
proportion. So both these are induction variables in the inner loop and also in the outer loop.
So, both these i and j can be eliminated, it is possible to introduce a new and t2 of
course, can also be eliminated, t4 can also be eliminated. These are replaced by 2 variables
s2 and s4, there is a suitable initialization which is carried out here; so s2 becomes s2
plus 4 in the loop, s4 becomes s4 minus 4 in this loop. In the test which as i greater
than equal to j, this is the only place why i and j have been used, they are used to control
the loop, we use s2 and s4. Once we do this, it is possible to actually convert this program
to this program we can replace i greater than equal to j by s2 and s4 and the variables
s2 and s4 can be incremented appropriately by 4 and decremented by 4. This is the program
in which we have been able to do strength reduction, we have been able to do strength
reduction here also and then we have eliminated 2 induction variables i and j replace them
by s2 and s4.
Next, optimization is a very important one called the partial redundancy elimination.
There are actually, you know many optimizations which this particular PRE as it is called
subsumes. For example, loop invariant code motion gets subsumed in this. If we do this
properly, then 2 or 3 types of common sub expression elimination can also gets subsumed
by this; so 2 or 3 optimizations can all be subsumed by this particular optimization itself.
Let us understand this, here is a equal to c and then we have x equal to a plus b. Here
we have y equal to a plus b and we have b equal to d. There is no computation of a plus
b here and there is no computation of a plus b here; so if you had a plus b here also for
example, some z equal to a plus b, then you would have detected that this a plus b is
a common sub-expression because it is computed here and here. The GCSE algorithm would have
been applied but, now it is computed only along this path and it is not computed along
this path were before this particular computation. So, such a computation is a partially redundant
computation. If we take this path, there is no need to
do twice; maybe this value can be used directly here. If we take this path, then it is necessary
to do it; if we are able to detect that such partially redundant computations exist, that
is a plus b is partially redundant at this point; so it can be computed here and then
used here This can be done provided we introduce h equal to a plus b along this path also,
h equal to a plus b is introduced here. Now, what happens is, this is fully redundant this
particular a plus b becomes fully redundant with the introduction of this; that is why
a equal to c, h equal to a plus b, h is the temporary variable in GCSE and x equal to
h; h equal to a plus b again here and this becomes y equal to h. We have converted a
partially redundant computation to a fully redundant computation by breaking this particular
edge introducing this computation of a plus b.
The challenge is in detecting that this is a partially redundant computation and then
taking steps to see, how we can convert this PRE into a fully redundant computation and
then apply GCSE. This is a very interesting algorithm, which we are going to see later.
Let us look at a simple optimization called unrolling a loop. Here is a simple loop, for
i equal to 0, i less than n, i plus plus and then there are two statements S1 and S2. We
have put, i in the parenthesis here to indicate that these are instances of the ith iteration.
If we unroll the loop, we are going to reduce the number of iterations. Let us say, we have
unrolled it thrice that means, there are going to be three instances of S1 and S2 in the
body of the loop S1 i; S2 i; S1 i plus one; S2 i plus1; S1 i plus 2 and S2 i plus 2. The
loop is actually going to run less number of times, it starts with 0 checks whether
i plus 3 less than N and then i is incremented by 3 because we have already included the
instances of S1 and S2 corresponding to i plus 1 and i plus 2 also.
At the end, there may be some iterations which may remain because if N minus 1 is not a multiple
of 3; say for example, you have N equal to 3 then, N minus1 is 2 so it is not a multiple
of 3. So, this cannot be done because we need only this, then it is going to be executed,
the rest of the iterations will all be executed by this particular loop which remains.
We start with k equal to I, k less than N, i plus plus and then without any unrolling
we execute S1 k and S2 k. The advantage of loop unrolling is it enlarges the body the
loop. Now, this body is much larger than before, this was S1 to S1, S2 here we have many instances
of S1 and S2 and because of this the other types of machine dependent optimizations such
as instruction scheduling, software pipelining etc become very efficient. We will learn about
this much later but, it suffices to say that this is a very efficient way of enabling other
machine dependent optimizations. So, with this we will stop this lecture and
in the next lecture, we will continue with other examples of optimizations.Thank you