Tip:
Highlight text to annotate it
X
Today
let us consider designing of another pipeline architecture. You know that if we want to
multiply two matrices, let us assume that we have two matrices A and B and we want to
multiply the two matrices to generate a third matrix C.
The matrix multiplication is multiplying two matrices A and B to generate the third matrix
C. We assume that this matrices A and B both are of dimension three. That is they have
three number of rows and three number of columns. So for matrices A the elements are a11 a12
a13 a21 a22 a23 a31 a32 a33. This is matrix A. Similarly matrix B is also of dimension
three given by b11 b12 b13 b21 b22 b23 b31 b32 and b33 and you all know that if we want
to multiply these 2 matrices, our resultant matrix will be given by, every ijth element
in matrix C is given by aik multiplied with bkj sum of k equal to 1 2 3. If we write a
program for creating this matrix C, the program will simply look like this, that for i equal
to 1 to 3 do, for j, aik. So it is cij equal to cij plus aik into bkj and before this,
at this point I have to initialize cij to zero. So I need such a for loop to compute
this matrix C and you find that here, the complexity of this matrix multiplication root
in will be 3 into 3 into 3, that is 27 number of operations. Now let us say whether we can
design a pipeline architecture so that I can multiply this matrix in less amount of time.
So for this pipeline architecture I assume that every element in the pipeline architecture
is having a function like this. It has got three inputs, inputs a b and c and it provides
three outputs. In the horizontal and vertical directions whatever is the input, the same
thing comes as output. So here the output will be a, here the output will be b, whereas
in the diagonal direction the output will be c plus a into b. So this is the processing
element using which we will try to implement, try to design a pipelines architecture so
that I can multiply this 3 by 3 matrices. So the architecture will be something like
this. So as we have said that every such processing element will have three inputs, one we represent
as horizontal input, the other one as vertical input and we have a diagonal input. So it
is like this, the inter connection will be met of this form.
So this is the type of inter connection that we will have and you will feed the data on
this input sites. For feeding the data we will assume that during time t1, the data
that we feed to this vertical input or like this 0 0 a13 a12 and a11. So that is one row
of matrix a, during the same thing t1 we have to feed in one column of matrix b to this
horizontal inputs. So those are b11 b21 b31 and these inputs are zero. During time t2
I feed the second row of matrix a in this vertical input. So that will be 0 a21 a22
a23 and this input is 0 and during same time t2, I will feed the second column of matrix b to
this horizontal input. So this will be 0 b12 b22 b32 and this input will be zero.
Similarly during time t3 I feed the third row of matrix a to this vertical input and
this becomes a31 a32 a33 and during the same time t3, I feed in the third column of matrix
b to this horizontal input. So this input becomes 0 0 b13 b23 and b33 and these are
actually delay elements which are shown by rectangles or latches. Now given this type
of diagram, let us see how this architecture works. You find that since each of these processing
elements are of this type, that means whenever you feed in a as horizontal input, b as vertical
input and c as the diagonal input, this will output a in the horizontal direction. The
same thing is passed out, it will output b in the vertical direction whereas in the diagonal
direction it will give c plus a into b.
So if we assume this, you find that… in addition to this diagonal inputs are always
zero, these are always zero. So you find that since at time t1, we are feeding one row of
matrix a in the vertical direction and feeding one column of matrix b in the horizontal direction.
So once this data are available, what is the output of this processing node? That is 0
plus a11 into b11. So at this point I will have the output as a11 into b11. What I will
have here? Simply a11 and in this direction I will have simply b11.
Similarly after same time t1 what I will have in this direction? Simply a12 will be passed
here in the vertical direction because this is the vertical input, b21 which is fed horizontally
that will be passed here. So what I get during t2 is during t2 here I am feeding a21 but
during t2, a12 is already available. Similarly during t2 this b21 is already available here.
The next data that I am feeding is b12. So after time instant t2 what I will get here?
Here I have a12 available here. I have b21 available and in this diagonal direction I
have a11 into b11. So in this diagonal output, I will have a11 into b11 plus a12 which is
already available on this link into b21 which is already available on this link.
So here I will have a11 into b11 plus a12 into b21 available on this link after time
t2. Similarly after time t3 what I will get here is a11 into b11 plus a12 into b21 plus
a13 into b31 that will be available here. That is after time t3 and simultaneously because
this data is flowing through this pipeline, all other components will also be generated.
So if you analyze this way, you will find that the output will be something like this.
You will get starting from time t6 onwards, so I will put it like this. After t5 that
is t6, during time t6 what I will get here? c11; c11 will be outputted by this. That is
after time t5 that is during time t6, I will get c11 here. At the same time, during time
t6, I will get c21 which will be outputted by this node in the diagonal direction. During
the same time t6, I will get c31 which will be outputted by this node.
During same time t6, this node will output c12 and this node will output c13. During
time t7 this node will output c32, this node will output c22 and this node will output
c23 and this output I will get during time t7 and during time t8, this node will output
c33. How the data will flow? If you analyze that, you will find that after time t6 simultaneously
I get c11 c12 c13 c21 and c31. During time t7, I get c22 c23 and c32 and during time
t8, I get c33. So as against in this case where we have said that for this simple multiplication,
matrix multiplication I need 27 operations. The same output is obtained in just 8 clock
periods. So this gives another example how pipelining can reduce the computation time
and this sort of pipeline has a specific name systolic array, you know that. These two types
of pipeline architectures that we have discussed, one for floating point addition and the other
for this matrix multiplication, you will find that these kinds of pipelines are dedicated
pipelines. That means you can perform only one operation with this pipelines but it is
ensured that operation is more efficient.
We can have other kinds of pipeline as well which are called generalized pipelines. What
is the generalized pipelines? In both these kinds of pipelines, you have seen that the
data flows only in one directions. That is from stage i, the data goes to the computation
stage i plus 1. I don’t have any provision of giving the data to a stage j where j is
greater than i plus 1. From stage i, I cannot give the data to a stage j where j is greater
than j plus 1 or j is less than i. That means I don’t have any provision of feed forward
data flow or I don’t have provision of feedback data flow. So in case of generalized pipeline
we can have the provision of both feed forward as well as feedback data flow.
So in this case of pipeline what we have done is or if you remember the earlier pipeline
that we have done for floating point addition, we have always said that the data enters from
one end of the pipeline, it comes out of the other end of the pipeline. From stage s1 the
data always goes to stage two. From stage two it always goes to stage three, from three
it always goes to four. I don’t have any facility of sending the data from stage one
to stage three. I am always sending the data from stage one to stage two or I don’t have
any facility of giving the data from say stage three to stage one. That is I don’t have
any provision of feedback connection. Here in this case from this stage, the data is
always moving in this direction. From here the data always goes here, the data always
goes here that means it always goes in the forward direction following a linear path.
I cannot bypass of any this. I don’t have any provision of sending the data from this
processing element to this processing element directly. I always have go to where this processing
element. So this is a kind of pipeline which is also known as linear pipeline. The data
always flows in the linear direction.
In a generalized pipeline what we will see is from any stage i, I can send the data to
any other stage j where j can be greater than i plus 1. In case of linear pipeline, j is
always equal to i plus 1. form i, I can send the data to j where j is always i plus 1 equal
to i plus 1 in case of linear pipeline. In case of a generalize pipeline, j can be greater
than i plus 1 that means I can bypass some of the stages in between and feed forward
the data or j can also be less than i that means I can send the data in the backward
direction in the pipeline. So such a generalize pipeline can have a structure like this. Suppose
we have got three spaced pipeline s1, s2 and s3. If it is a linear pipeline, always data
will move from s1 to s2, s2 to s3 and finally you get the output.
Now if it is not a linear pipeline and we have also said that to take care of the speed
mismatch, we can put latches in between stages. So let us also put the latches. We will have
one latch here, one latch here, one latch here and finally one latch will come here.
From the latch, the data always go to the corresponding stage. I will have an input,
this stage will give an output, this will also give an output, this will also give an
output but now I have to have the provision of selecting the source of data. So as I said
that in this generalize pipeline, in case of linear pipeline the data will always move
from s1 to s2. So this output of s1 can be directly connected to the input of this latch.
Similarly the data will flow from s2 to s3, so this output of s2 will be directly connected
to the input to this latch but now in case of generalized pipeline, I can also have a
provision that from s1 I should be able to send the data to s3 bypassing s2.
Similarly from output of s3 I should be able to send the data to s2, I should also be able
to send the data to s1 which will be the feedback connection. If I send the data from s1 to
s3 that becomes a feed forward connection. from s1 to s2 or s2 to s3 that becomes a linear
connection. So all these three types of connections should be possible in a generalized pipeline.
so because now the source of data to every stage can be one of many, so I should put
some sort of multiplexer using which I can select the source of data to the input of
every latch. So what I can do is I can put a multiplexer here. One input of this multiplexer
is the input data and output of the multiplexer is coming to the input of this latch. The
other inputs to this multiplexer can be from output of s3.
Similarly I will also put a multiplexer here that is input to s2 can also come from the
output of s3. So like this and similarly s3 can get input from s2, it can get input from
s1. It can also get input from s3 itself that is the self-loop. So in this case I can do
a connection of this form s1, the data can go to s3, the data can come to s3 from s3
itself. I can have data path something like this and each of these are multiplexers. This
is a multiplexer, this is also a multiplexer, this is also a multiplexer. So you find that
by properly selecting the select inputs of the multiplexers, I can send the data from
output of s1 to the input of s3. I can send the data from output of s3 to the input of
s1, I can also send the data from output of s3 to the input of s2. I can also send the
data from output of s3 to s3 itself.
So I have a reconfigurable type of pipeline. Depending upon what is the function requirement,
I can choose these multiplexer select inputs so that the data will be routed through this
pipeline as required. For functions I can also perform, this can be used for computing
a single function. It may also be used for computing multiple function. That is more
than one functions can be computed by the same pipeline. So if I want to compute more
than one functions then suppose I can take an output from this stage and this gives me
a function A. If I take output from this point, it gives me a function B. So there may be
other connection as well. For example in this case we have not shown a feedback connection
from s2 to s1. That is also possible. With the help of this, I can only feedback the
data from output of s3 to s1, I can feedback of data from output of s3 to s2, I can also
feedback from s3 to s3 itself. I can also have a feedback connection from s2 to s1 or
s2 to s2 itself that is also possible in a generalized pipeline.
Now given such a generalized pipeline, now we have to decide that because now it is reconfigurable.
I have to decide that how the data path should be selected, how the data should flow to compute
a particular function which is not required in case of a linear pipeline because in case
of linear pipeline, the data flow is always fixed. From stage i it will go to stage i
plus 1 but in this case because there can be various ways in which the data can flow
from one stage to another, I had to specify for computation of a particular function,
how the data path should be selected. So for computing any function, I have to specify
what is called a reservation table. That means what are the time steps, what are the time
periods during which, which of these computation stages will be used for computation of a particular
function.
So reservation table can be specified again in the form of a space time diagram. Say for
example for function A, for the computation of function A my reservation table can look
like this. There are three computation stages s1 and s2 and s3 and these are the different
time steps or time periods t0, t1, t2, t3, t4, t5, t6, t7. So when I specify the reservation
table for function A, what I have to specify is during which of these time periods which
of the computation stages will be used by this function A. So I can have a reservation
table something like this. I can put a cross here indicating that during time interval
t0, it is the computation stage s1 that will be used by this function. Similarly I can
put a cross here, I can put a cross here. For s2 it can be something like this, for
s3 it can be something like this.
Similarly to compute the function B, function B will also have a reservation table and that
reservation table will obviously be different from this. If it is not different from this
then obviously function A and function B are same. So for function B, the reservation table
can be specified. Again I have the computation stages s1 s2 s3, I have the time durations
t0 t1 t2 t3 t4 t5 t6, so something like this. So let us see how the data path will be set
for computation of these two functions.
If the tables are same, output point is also same. Say depends upon during the last step,
last time period which of the computation stages has been used? You will find that function
for A, it takes 8 different time steps starting from t0 to t7. During t7 the computation block
that is used is s2. That means for function A, s2 is the output point. For function B
at the last time step t6, the computational block that is used is s3 that means for function
B, s3 is the output point. For function A the output point is s2, for function B the
output point is s3 so that is what has been shown in this pipeline diagram also.
I have taken output A, function A from the output of s2 whereas function B has been taken
from output of s3. So as this reservation table says, it specifies that during which
time step, during which time period which of the computational blocks are to be used
for computation of a particular function. So here you find that for computation of say
A during t0, the computation block that is used is s1 and this being the first one, the
input obviously has to come from the external input. That means the data path that has to
be active during t0 is the direct data path from the external input to this latch and
from the latch it directly goes to computation stage s1.
Then you find that during t1, the computation block that is used is s2, during t2 the computation
block is that is used s3. So these follows a linear path. At least for this first three
timing durations, the data flow is linear. From s1 the data will go to s2, from s2 the
data will go to s3 but after that during time interval t3, the processing block that is
being used is s1 and prior to that, we had the output available from processing stage
s3. That means now I have to feedback the data from the output of s3 to the input of
s1. So at this point the path that is to be active is this one and I have to set the multiplexer
select input in such a way that now instead of taking the data from the external input,
it has to take the data from this feedback path. From there data will come to s1 and
the corresponding computation will be performed.
After s1 during time step t4, you find the computational block that is to be used is
block s3. So now it does not follow the linear path anymore. In case of linear path here
from t0 to t1 we had moved the data from s1 to s2 but now we have to move the data from
s1 to s3. That means I have to make use of the feed forward path and this is the feed
forward path. So you have to bypass s2 and set the select input of this multiplexer in
such a way that the data from this input will be selected and that will be fed to the computational
block s3 for this part of computation. This way the computation will continue and at the
end of t7, you get the output from the computation stage s2.
Similarly we can also find out how the data flow will be performed for this function B.
So it is this reservation table which specifies that how the data will flow through the pipeline
and how many times it will be fed back or feed forward to compute a particular function.
Obviously the pipeline control circuit which decides that how the multiplexer select input
will be decided or how the data will be latched into different latches that will make use
of this reservation table because it is only through the reservation table, the control
circuit will know how the data has to move or how many times steps will be needed for
computation of a particular function. This reservation table is very important for a
generalized pipeline. They will have different delays. They will generate same delay. No.
Blocks can have different delays but the final delay, I mean how much delay is allowed that
is decided by the control circuit. That is the control circuit will which will decide
that when to latch the data from one stage to the latch and that takes care of what is
the maximum delay in different stages.
At a particular time step only one block is working for these functions but it is not
necessary I can have a situation like this as well. This is also possible in a reservation
table. That means for computation of function A during time step t3, you are making use
of both the computation stage s1 and stage s3. So if you do this, you will find that
during time step t2, s3 was used. So after time step t2 the output is available for s3.
During this time both s1 and s3 are being used. That means I have to set the multiplexer
in such a way that is that output of s3 goes to both s1 as well as s3 itself. To s3 itself
is going to s1 also. Data output of s1 this is leaving an output okay then after that
the stage which is used is s3. You are talking about this multiplexer, may be some more function
will be generated here. That is also possible instead of simple multiplexer I can have a
functional block.
So multiplexer will be a part of that functionality also or may be before s3 I have one computation
stage which will combine these two outputs, output of s three and output of s1. What I
have shown is a very simplified situation but this can be complicated. So in case of
generalized pipeline, it is possible that more than one computation stage will be used
during the same instant of time. That can be done. Not only that, in a complicated situation
it is also possible that the pipeline is being used for computation of more than one functions
simultaneously, that is also possible. But how to do it that you have to find out. That
has to be found out by analysis of the reservation table.
Now when I come to this generalized pipeline, here you find in case of linear pipeline I
didn’t have any problem. At every time step I can feed in a new data, a new task to the
pipeline but when I come to this generalized pipeline that is not possible. By studying
this reservation table, I have to find that what are the time steps, when I can feed in
a new task to the pipeline. I cannot feed in a new task to the pipeline at every time
step because in that case there can be data clash how? See for example here, for computation
of a particular function f A, function A the pipeline makes use of this computation block
s one during time instant t8. It makes use of the same computation stage s1 during time
instant t three. It also makes use of the same computation stage s1 during time instant
t6.
So if at t3, during t3 I want to feed in a new task, a new set of data but on which the
same function A has to be computed. Then you will find that this t3 becomes t0 for that
new task. So for the new task s1 has to be used simultaneously, when s1 is also being
used by the previous task which is not possible. That means two task at time to use the same
processing stage s1 at the same instant of time. So obviously there will be clash. So
that is what is called pipeline scheduling. I have to decide, when I can schedule a new
task to the pipeline. I cannot schedule a new task to the generalized pipeline every
time which I can do in case of a linear pipeline because in case of a linear pipeline, I know
that the data enters one end of the pipeline, it goes out of the other end of the pipeline
and in between it always has a linear flow. That means from s1 it will always come to
s2.
For the same function, s1 will not be used again. So at every time step, I can give a
new task to a linear pipeline. I don’t have to consider about this collision. I don’t
have to bother about this collision but in case of a generalized pipeline, I have to
find out that what are the time durations or what are the time steps, when I can feed
in a new task to the pipeline. That I have to do by analysis of this reservation table.
So now let us see how we can analyze this reservation table.
So one thing I have said that here if I initiate a task at t0, I cannot initiate next task
at time instant t3 because then there will be clash for this process stabilized one.
Neither I can initiate a task at time instant t6 because then also there will be a clash
on s1. Similarly here I cannot initiate at a task at t1 sorry. If at some instant of
time at t1, this function A makes use of this processing stage s2, I cannot initiate a task
in such a way that the next task will make use of the same processing stage s2 during
time instant t7 because then there will be a clash in this processing stage s2. So I
have to analysis this reservation table so that at no computation stage, there is a clash.
So only when I know that by initiating a task, it does not lead to any clash I can initiate
a new task at that time instant. That yet can be founded only by analysis of this reservation
table. That we will do in the next part.