Tip:
Highlight text to annotate it
X
The following content is provided under a Creative
Commons license.
Your support will help MIT OpenCourseWare continue to
offer high quality educational resource for free.
To make a donation you or view additional materials from
hundreds of MIT courses, visit mitopencourseware@ocw.mit.edu.
PROFESSOR: So let's get started with the second
lecture for today.
So I guess one thing multicores did, is really
shatter this nice view of writing in your programs and
hardwares to take care of, giving you performance.
So hardware just kind of completely gave that up.
But so what you're doing in this class, is you're trying
to do it by yourself.
Give all the responsibility back to the program.
And you realize as you go, it's a much harder job.
I mean, this is not simple programming.
So you need to have, you don't have MIT class students on
every company to do this, so we need to have some kind of
middle ground.
And so some of the stuff we have been doing is trying to
figure out are there any middle ground.
Can you actually take some of that load away from the user
into things like languages and compilers.
So we will talk about some of those.
So right now we are kind of switching from directly doing
what's necessary, to do the Cell project into going
breadth So this lecture, and then we will sit back and do a
little bit of debugging and performance work and that will
be directly helpful.
And then next week we'll have lots of guest lectures to kind
of give you breadth in there.
So you'll understand no just Cell programming but parallel
programming and parallel processing, what the world is
like beyond that.
So today we're going to have Bill talk about streams.
BILL THIES: OK very good.
So my name is Bill Thies.
I'm a graduate student working with Saman and Roderick, and
others here.
And I'll talk about the StreamIt language.
So why do we need a new programming language?
Well we think that languages haven't kept up with the
architectures.
So one way to look at this is that if you look back at
previous languages, look at C with von-Neumann machine.
Now I grew up in rural Pennsylvania not too far from
Amish Country.
And so to me these go together just like a horse and buggy.
OK they're perfectly made for each other.
They basically go at the same rate.
Everything is fine.
But the problem is, in comes the modern architecture.
OK this is an F-16.
you have a lot more that you can do with it then, then with
the horse and buggy.
So how do you program these new architectures?
Well architecture makers these days are basically faced with
a really hard choice.
On the one hand, you could get a really cool architecture and
develop an ad hoc programming technique where you're really
just leaving it to the programmer to do something
complicated to get performance.
And unfortunately I think that's the
route that they took.
I mean fortunately for the industry, but unfortunately
for you, I think that's the route they took with cell
which means all of you are going to become basically
fighter pilots.
You have to learn how to fly the plane.
You have to become an expert.
You're going to become the best people at programming
these architectures.
And unfortunately the only other option is to really bend
over backwards to support the previous era of languages
like C and C .
And you can see what's coming here, it's just hard to get
off the runway.
So you don't want this situation.
Out of consideration for whoever is in the buggy,
hopefully you'll never take off.
So looking from a more academic perspective, why do
we need a new language right now?
So if you look back over the past 30 years, I know you've
seen this graph before.
We were dealing with just one core in the machine
for all this time.
And now we have this plethora of multicores
coming across the board.
So how did we program these old machines?
Well we had languages like C and FORTRAN that really have a
lot of nice properties across these architectures.
So it was portable, high-performance, composable--
you could have really good software development--
malleable, maintainable, all the nice things you'd like to
see from a software engineering perspective.
And really if you wrote a program back in 1970, you
could keep it in C and have it continue to leverage all the
new properties of machines over the past 30 years.
So just one fine out of the box.
And looking forward, that's not going to be true.
So for example, we could say that C was the
common machine language.
That's what we say for the past 30 years, was common
across all the machines.
But now looking forward, that's not
going to be true anymore.
Because you have to program every core separately.
So what's the common machine language for multicores?
We really think you need something where you can write
a program once today, and have it scale for the next 30 years
without having to modify the program.
So what kind of language do you need to get that kind of
performance?
Well let's look a little deeper into this notion of a
common machine language.
So why did it work so well for the past 30 years?
Well on uniprocessors, things like C and FORTRAN fran really
encapsulated the common properties.
So things like a single flow of control in the machine, a
single memory image, are both properties of the language.
But they also hid certain properties from the
programmer.
So they hid the things that were different between one
machine and another.
So for example, the register file, the ISA, the functional
units and so on.
These things could change from one architecture to another.
And you didn't have to change your program because those
aspects weren't in the programming language.
So that's why these languages were succeeding.
And what do we need to succeed in the multicore era from a
language perspective?
Well you need to encapsulate the common properties again.
And this time it's multiple flows of control that you have
for all the different cores, and multiple local memories.
There's no more monolithic memory anymore, that everyone
can read and write to.
Also you need to hide some of the
differences between the machines.
So some cores have different capabilities.
On cell there's a heterogeneous system between
the STEs and the PPE.
Different communication models on different architectures,
different synchronization models.
So whatever common machine language we come up with,
we'll have to keep these things hidden from the
programmer.
Now a lot of different researchers are taking
different tacts for how you want to invent the next common
machine language.
And the thrust that we're really excited about is this
notion of streaming.
So what is a stream program?
Well if you look at a lot of the high-performance systems
today-- including Powerpoint which is running this awesome
animation--
you can basically see that they're based around some
stream of data.
So audio, video, like HDTV, video editing, graphic stuff.
I think actually, a lot of the projects in this class that I
looked at, would fit into the streaming mold.
Things like the software radio, array tracing, I
probably don't remember them all.
But when I looked at them, they all looked like they had
a streaming component somewhere in there.
So what's special about a stream program compared to
just a normal program?
Well they have a lot of attractive properties.
If you look at their structure, you can usually see
that the computation pattern remains relatively constant
across the lifetime of the program.
So they have some well-defined units that are communicating
with each other.
And they continue that pattern of communication throughout.
And this really exposes a lot of opportunities for the
compiler to do some optimizations that it couldn't
do on just an arbitrary general purpose program.
And if you saw before, we have basically all the types of
parallelism are really exposed in a stream program.
There's the pipeline parallelism between different
producers and consumers.
There's the task parallelism basically going
from left to right.
And also data parallelism which means that a single one
of these stages can sometimes be split to apply to multiple
elements in the data stream.
So when you're thinking about stream programming, there's a
lot of different ways you can actually
represent the program.
So whenever you have a programming model, you have to
answer these kinds of questions.
For example do the senders and the receivers block when they
try to communicate?
How much buffering is allowed?
Is the computation deterministic?
What kind of model do you have in there?
Can you avoid deadlock?
Questions like these, and we could spend a whole lecture
answering these questions, putting them in different
categories.
But what I want to just to do, just to give you a feel is
touch on kind of three of the major models that you might
see come up in different kinds of programming models.
And I'll just touch Kahn process networks, synchronous
dataflow, and communicating sequential processes, or CSP.
So just one slide on these models.
So let's compare them a little bit.
First there's the Kahn process networks.
So this is kind of the simplest model.
It's very intuitive.
You just have different processes that are
communicating over FIFOs.
And the FIFO size is conceptually unbounded.
So to a first approximation, it's kind of like a Unix pipe.
These processes can just read from the input, and they can
push onto their outputs without blocking.
But if they try to read from an input they do block until
an input is available.
And the interesting thing is that the communication pattern
can actually be dependent on the data.
So for example I could pop an index off of one channel, and
then use that index to determine which other channel
I'll read from on the next time time step.
But at the same time it is deterministic.
So for a given series of input values on the stream, I'll
always have the same communication pattern that I'm
trying from the other input.
So if it's a deterministic model, that's a nice property.
Let's see, what else to say here?
There's actually a few recent ventures that are using Kahn
process networks.
So there's commercial interest. For example Ambric
is a startup that I think will be based on a Kahn process
network for the programming model.
Looking at another model called synchronous dataflow,
this is actually what we use in the StreamIt system.
And compared to Kahn process networks,
it's kind of a subset.
It's a little bit more restrictive.
So if you look at the space of all possible program
behaviors, Kahn process networks are a pretty big
piece of the space.
And then synchronous dataflow is kind of a subset of that
space where you know more about the communication
pattern at compile time.
So for example, in synchronous dataflow, the programmer
actually declares how many items it will consume from
each of its in put channels on a given execution step.
So there's no more data dependence regarding the
communication pattern.
It'll always input some items from some of the channels and
produce some number of items to other channels.
And this is a really nice properties because it lets the
compiler do to scheduling for you.
So the compiling can see who's communicating to who and
exactly what pattern.
And it can statically interleave the filters to
guarantee that everyone has enough data to complete their
computation.
So there's a lot of interesting optimizations you
can do here.
That's why it's very attractive for StreamIt.
And you can statically guarantee freedom from
deadlock, which is a nice property to have.
The last one I want to touch on is communicating sequential
processes or CSP.
And in the space of program behaviors, it's kind of an
overlapping that from Kahn processing networks, and adds
a few new semantic behaviors.
So the buffering model is basically rendezvous
communication now.
So there's no bothering in the system.
Basically anytime you send a value to another process, you
have to block and wait until that process will actually
receive that value from you.
So everyone is rendevouzing at every communication step.
In addition to that, they have some sophisticated
synchronization primitives.
So you can for example, discuss alternative behaviors
that you have. You can either one thing or another which
will introduce the nondeterminism in the model.
Which could be a good or a bad thing depending on the program
you're trying to express.
And pretty much the most well-known encapsulation of
CSP is this occam programming language invented
quite a while ago.
And some people are still using that today.
Any questions on the model computations?
OK.
So now let me get into what StreamIt is.
So StreamIt is a great language.
It's a high-level
architecture-independent language.
Oh question the back.
AUDIENCE: With the CSP I'm trying to understand exactly
what that means or how it's different.
Is basically what's it's saying is you have a bunch
processes and they can send messages to each other.
BILL THIES: So all these models have that property.
AUDIENCE: They all fit into that.
But it seems like from your explanation of CSP, that that
was just sort of the essence of CSP is it more specific?
BILL THIES: So CSP is usually associated with rendezvous
communication.
That's the side of programs that fit inside
Kahn process networks.
It's any communicating model where you basically have no
buffering between the processes.
Now the piece that sits outside is usually lumped with
CSP, or especially with occam They have a set a primitives
that are richer in terms of synchronization.
So for example, you can have guards on your communication.
Don't execute this consumption from this channel until I see
a certain value.
So there's some more rich semantics there.
And so that's the things that are usually outside.
They're outside the other models.
Does that make sense?
Other questions?
OK so StreamIt.
OK so StreamIt is architecture-independent.
It's basically a really nice syntactic model for
interfacing with these lower level models of computation
for streaming.
And really we have two goals in the StreamIt project.
And the first is from the programmer's side.
So we want to improve the programmer's life when you're
writing a parallel program.
We want to make it easier for you to write a parallel
program then you would have to do in C or a language like
Java, or any other language that you know.
And at the same time, we want scalable and portable
performance across the multicores.
So an interesting thing these days is you'll find, is it's
often very hard a tempt the programmer to switch to your
favorite language based solely on performance.
Or at least this has been the story in the past. It may
change, looking forward.
Because it's a lot harder to get performance these days.
But usually you have to offer them some other carrot to get
them on board.
And you know the carrot here is that it's really nice to
program in.
It's fun to program in.
It's beautiful.
It's a lot easier to program and stream it then it would be
in something like C or Java for a certain class of
programs. So that's how we get them on board, and then we
also provide the performance.
We're mostly based on the synchronous in dataflow model.
In that when there are static communication patterns, we
leverage that from the compiler side.
So I'll also tell you about some dynamic extensions that
we have, that is the much richer model of communication.
So what have we been doing in the Streamit Project?
We have kind of a dual thrust within our group building on
this language.
So the first thrust is from the programmability side,
looking at applications and programmability What can we
fit into the streaming model?
And we're also really pushing the optimizations.
So what can you do from both a domain specific optimization
standpoint, as kind of emulating a DSP engineer or a
signal processing expert in the design flow.
And also architecture specific optimizations.
So we've been compiling for a lot of parallel machines.
And we were hoping we could have a full system for you
guys, this IEP, so you could write it then stream it and
then hit the button.
And it would work the whole way down to cell.
Unfortunately we're not quite there yet.
But we do have a pretty robust compiler infrastructure.
And you can download this off the web and play with it if
you want to.
One of our backends that we've released so far actually does
go to a cluster of workstations.
So it's kind of an MPI-like version of C. It uses Pthreads
for the parallelism model.
And I mean, depending on what kind of a hacker you are, you
actually might be able to lower that down onto cell.
So some of the stuff you might be able to use if you're have
some initiative in there.
And of course we'd be willing to work with
you on this as well.
so we have lots optimizations in the tool flow.
And actually Saman will spend another lecture focusing on
the StreamIt compiler, and how we get
performance out of the model.
OK, so let's just jump right in and do the analog of Hello
World in StreamIt.
I'm going to kind of walk you through the language and show
you the interesting pieces from an intellectual
standpoint.
What's interesting about a streaming model that
you can take away.
So instead of Hello World, we have a counter.
Since we're dealing with stream programs here, you're
not usually doing text processing.
So how do you write counter?
Well there are two pieces to the program.
The first is kind of the interconnect between the
different components.
That's what we have up here.
We're saying the program is a pipeline with two stages, it
has a source, and it has a printer.
And then we can write the source and
the printer as filters.
We call those basic building blocks filters in Streamit.
So the source will just have a variable x that it
initializes is zero.
And then we have a work function which is
automatically called by our runtime system every time
through the steady state.
So this work function well push one item on to the output
channel, and it'll increment the value afterward.
Whereas the intPrinter at the bottom here,
will input one value.
And its work function here just pops that value off the
input tape, and prints it to the output.
Now how do we run this thing?
Well there's no main function here like you
see in Hello World.
Is there comment?
Oh, sorry.
AUDIENCE: The two meanings of push and two meanings of pop.
BILL THIES: Two meanings?
AUDIENCE: Push 1, 2, 3.
BILL THIES: Yeah, yeah.
So the first push here is just declaring that this work
function will push one item to the output tape.
So this is the synchronous dataflow aspect.
Were associating an output rate with this work function.
So that's a declaration here.
Whereas this push is just actually executing the push
onto the output.
So how do we run this thing?
Well we compile it with a StreamIt compiler, store C
into a binary.
And then when we run, we run for a given number of
iterations.
So you don't just call it once.
Our model is that this is a continuous stream of data
going through the program.
And so when you run it, you run it for some number of
iterations, or basically input or output items, is what
you're running it for.
So if you run this for four iterations, it would
print in the 033.
So we can leverage this steady flow of data.
Yeah Amir?
AUDIENCE: 1, 2, 3, 4, pushing X plus plus .
BILL THIES: I think the plus, plus is
executed after the push.
AUDIENCE: Push plus plus?
BILL THIES: So it starts at zero.
So I think a PostFix expression executes after the
actual obsession.
Yeah.
Yeah.
Other questions?
OK so let's step up a level and look at
what we have in StreamIt.
So the first question is, how do you represent this
connectivity between different building blocks?
How do you represent streams?
And if you look at traditional programming models, kind of
the conventional wisdom is that a stream
program is a graph.
You have different nodes that are
communicating to each other.
And graphs are actually kind of hard to analyze.
They're hard to represent.
They're a little but confusing.
So the approach we decided to take in StreamIt is one of a
structured computation graph.
So instead of having arbitrary inner connections between the
stages, we have a higher hierarchical description in
which every individual stage has a single input and a
single output.
And you can compose these together into
higher level stages.
Of course there's some pages that do split and join with
multiple inputs.
We'll get to that.
So the analog here is kind of ah analogous to structured
control flow, in your favorite
imperative programming language.
Of course there was a day when everyone used goto statements
instead of having structure control flow.
We've got a fan of goto statements in the audience?
OK, I'll get to you later.
But the problem was, it's really hard to understand the
program that's jumping all over the place because there's
no local reasoning you can have. You know you're jumping
to this location, you're coming back a different way.
It's hard to reason about program components.
So when people went to structured control flow,
there's just if else, four loop statements.
Those are the two basic constructs.
You can basically express all kinds of computation in those
simple primitives.
And things got a lot simpler.
And you know people objected at one point even.
You know what about a finite-state machine?
Don't you need goto statements for a finite-state machine,
going from one state to another another.
And now everyone writes in FSM with a really simple idiom.
You usually have a while loop around a case statement.
Right, you have a dispatch loop.
So and now whenever you see that pattern you can
recognize, oh there's a finite-state machine.
It's not just at set of gotos.
It's a finite-state machine.
So we think there are similar idioms in
the streaming domain.
And that's kind of the direction we're pushing from a
design standpoint.
So what are our structures that we have?
Well here are our structured streams. At the
base we have a filter.
That's just the programmable unit like I showed you.
We have a pipeline, which just connects one stream to another
in a sequence.
So this gives you pipeline parallelism.
There's a splitjoin where you have explicit parallelism in
the stream.
So I'll talk about what these splitters and joiners can do.
It's basically a predefined pattern of scattering data to
some child streams, and then gathering that data back into
a single stream.
So the whole construct still remains single input and
single output.
Likewise a feedback loop is just a simple way to put a
loop in your stream.
And of course these are hierarchical.
So all of these green boxes can be any of the three
constructs.
So that's how you can have these hierarchical graphs.
And again, since everything is single-input single-output,
you can really mix and match.
You know choose your favorite components, and they'll always
fit together.
You don't need to stitch multiple connections.
So let's dive inside one of these filters now.
And I gave you a feel for how they look before, but here's a
little more detail.
So how do we program the filter?
Well a filter just transforms one stream
into another stream.
And here it's transforming a stream of floating-point
numbers into another floating-point number stream.
I can take some parameters at the top.
These actually fixed at compile time in our model,
which allows the compiler to really specialize the filters
code depending on the context in which it's being used.
So for example here, we're inputting N in a frequency.
And then we have two stages of execution.
At initialization time-- this runs one at the beginning of
the whole program-- we can calculate some weights for
example, from the frequency.
And we can store those weights as a local variable.
So you can think of this kind of like a Java class.
You can have some member variables.
You can retains state from one execution to the next.
The work function is the closest
thing to the main function.
This is called repeatedly in the steady state.
And here are the IO rates of the work function.
This filter actually peaks at some data items. That means
that it inspects more items on the input channel than it
actually consumes on every iteration.
So we'll look at N input items, and we'll push one new
item onto the output and pop one item from the input tape
every time we execute.
So here we have a sliding window computation.
It means the next time through, we'll just slide this
window up by one and inspect the next N items
on the input tape.
And inside the work function you can have pretty much
general purpose code.
Right now we just allow pointers and a few other
things to keep things simple.
But the idea is this is general purpose imperative
code inside the work function.
Now what's nice about this representations of a filter is
for one thing is this peak function.
So what we really have is a nice representation of the
data pattern, that you're reading the data
on the input channel.
And if you look at this for example in a language like C,
it's a lot messier.
So usually when you're doing buffer management, you have to
do some modulo operations.
You have to keep a circular buffer of your live data.
And increase you know, a head or tail pointer and mod around
the side with some kind of modulo operation.
And for a compiler, this is a real nightmare.
Because modulo operations are kind of the the
worst thing to analyze.
You can't see what it's actually trying to read.
And if you want to map this buffer to a network or to a
combined communication with some other actor or filter in
the graph, it's pretty much impossible.
So here is we're exposing that all to the compiler.
And you'll see how that can make a difference.
And also it's just a lot easier to program.
I mean, I don't like looking at the code.
So I'm going to go to the next slide.
OK, so how do we piece things together?
Let's just build some higher level components.
So here's a pipeline of two components.
And we already saw a pipeline.
You can just add one component after another.
And add just basically has the effect of making a queue, and
just queueing up all of the components that you added, and
connecting them one after another.
So here we have a BandPastFilter by connecting a
LowPassFilter and feeding it's output into a HighPassFilter.
You end up with a BandPassFilter.
OK what about a splitjoin?
How do we make those?
So a splitjoin has an add statement as well.
And here we're adding components in a loop.
So what this means is now when we say add, we're actually
adding from left to right.
So instead of going top to down, we're adding from left
to right across the splitjoin.
And we can actually do that in a loop.
So here we're intPrinting a parameter N. And depending on
that value, we'll add N
BandPassFilters to this splitjoin.
So it's kind of cool, right?
Because you can input a parameter, and that parameter
actually affects the structure of your graph.
So this graph is unrolled at compiled time by our compiler,
constructing a big sequence of computations.
And it can resolve the structure and communication
pattern in that graph, and then map it to the underlying
substrate when we compile.
Also to notice here are the splitter and the joiner.
So we have a predefined set of splitter and joiners.
I'll go into more detail later.
But here we're just duplicating the data to every
one of these parallel components, and then doing a
round-robin join pattern where we bring them back together
into a single output stream.
And if you want to do an equalizer, you basically need
an adder at the bottom to add the
different components together.
And another thing you can notice here is that we have
some inlining going on.
So we actually embedded this splitjoin inside a higher
level pipeline.
So what this does is it prevents you from having to
name every component of your stream.
You can have a single stream definition with lots of nested
components.
And the natural extension is, you can basically scale-up to
basically the natural size of a procedure, just like you
would in an imperative language.
And here is for example, an FM radio where we have a few
inline components.
And the interesting thing here is that there's a pretty good
correspondence between the lines of the text and the
actual structure of the graph.
And that's something that's hard to find in
an imperative language.
I mean if you want to for example, stitch nodes together
with edges, it's often very hard to visualize the
resulting structure of the graph that you have. But here
if you just go through the program, you can see that the
AtoD component goes right over to the AtoD, the demodulator
to the demodulator, and so on.
And even for the parallel components, you can kind of
piece them together.
so that's kind of how we think of building programs. Any
questions so far?
OK so this is kind of how you go about
programming in StreamIt.
But programming is kind of a chug n' plug activity right?
Nobody wants to be a code monkey.
The reason we're all here is to see what's beautiful about
this programming model.
Right?
Don Knuth said this, a famous computer scientist from
Stanford during his Turing Award Lecture you know,
"Some programs are elegant, some are
exquisite, some are sparkling.
My claim is that it is possible to write grand
programs, noble programs, truly
magnificent ones!" Right.
We want the best programs possible.
It's not just about making it work.
We want really beautiful programs. So what's beautiful
about the streaming domain?
What can you go away with and say wow, that was a really
beautiful expression of the computation.
Well for me I think one of the interesting things here is the
splitjoin contruct.
Splitjoins can really be beautiful.
You know some mornings I just wake up and I'm like, oh I'm
so glad I live in a world with splitjoins.
You know?
And and now splitjoins will be part of your world.
You can say this tomorrow.
This is just wonderful.
So OK, what do we having in the splitjoin constructs?
You can duplicate data.
You can do a round-robin communication pattern from one
to another, or round-robin join.
Now the duplicate is pretty simple.
You just take the input stream and duplicate
it to all the children.
No problem.
What do you do for a round-robin?
Well you path N items from the input to a given child.
So for example, if N is 1, we'll just distribute one at a
time to the child streams. And you get a pattern like this, a
round-robin just going across.
And you can do the same thing on the joiner side.
Let's say you're joining with a factor of one.
You're just reading from the children and putting them into
a single stream.
OK so the pretty colorful, but nothing too fancy yet.
Let's consider a different round-robin factor.
So a round-robin of 2 means that we peel off 2 items from
the input, and pass those items to the first output.
OK there actually is going to be a quiz on this.
So ask questions of this doesn't make sense.
OK pass the next 2 items, and 2 items round-robin like that.
And you can actually have nonuniform
weights if you want to.
So on the right let's say we're doing
round-robin 1, 2, 3.
That means we pass 1 item from the first stream, 2 items from
the next stream, and then 3 items from the third stream.
OK, pretty simple.
We're just doing 1, 2, and 3, and so on.
Does that make sense?
I'm going to build on this so any questions?
OK this was colorful but this totally beautiful yet.
So what's beautiful about this?
well let's see how you might write a matrix transpose.
OK, something you guys have probably all written at one
point in your life is transposing a matrix.
Let's say this matrix has M rows, and it has N columns
going across.
And we're starting with a representation in which the
stream is basically, I think this is row major order.
The first thing that you're doing is going across the rows
before you're going to the next the next row.
I'm sorry you're going across the columns, and then down to
the next row.
So it's zigzagging like this.
And you want to pass through a transpose, so you
zigzag the other way.
You do the first column, up, and then the next
column, and so on.
And this comes up a lot in a stream program.
So it turns out you can represent this as a splitjoin.
Oh that's not good.
Just in my moment in glory as well.
OK, yeah slides.
You can download slides.
Actually I can do this on the board.
This is the thinking part anyway.
Yeah, could you just bring up GMail?
I have a backup on GMail.
Can we focus on the board?
So this is going to be a little exercise anyway.
OK, here's what we had.
We had M, M rows, N columns.
We started with an interleaving like this.
Right, and we want to go into a splitjoin.
And this will be a round-robin construct.
And what I want you to do is fill in the round-robin
weight, and also the number of the streams. And you can have
a round-robin at the bottom.
And when it comes out, you want the opposite
interleaving.
This is M and N.
OK so there are 3 unknowns here, what you're doing the
round-robin by-- can you guys see that over there--
how many parallel streams there are, and what you're
joining the round-robin by.
Ok so I'm going to give you a minute to think about this.
Try to think about this.
See if you can figure out what these constants are.
You just basically want to read from this data in a row
major pattern, M rows and N columns, and end up with
something that's column major.
What are the values for the constant?
Does it makes sense?
Ask a question if it doesn't make sense.
Yeah?
AUDIENCE: So we assume that values are going to, based on
the line that you drew, across it, tests like a stream line?
BILL THIES: Right, right, right.
So those values are coming down the stream.
You have a 1-dimensional stream.
It's interleaved like this.
So you'll be reading them like this.
And then you the output a 1-dimensional stream that is
threading the columns.
Does that make sense?
Somebody ask another question.
Yeah
AUDIENCE: So the actual matrix transpose codes, it's my
understand that nobody actually does it sequentially
like that because of locality issues.
Instead it's broken up into blocks.
BILL THIES: So there are ways to optimize this.
AUDIENCE: And after you've sort of serialized it, can you
then capture..
PROFESSOR: Guys, [UNINTELLIGIBLE PHRASE]
has a blocking segment.
So you can heirarchically do that.
So, normally what happens is you do the blocks and inside
the blocks, you can do it again.
You can do it at two levels, basically.
BILL THIES: Any hypotheses?
Anyone?
AUDIENCE: Is it N for first one, M for the second one?
BILL THIES: OK, what do we have
AUDIENCE: N for the first one, and M for the second?
BILL THIES: N,M and
AUDIENCE: Same for the M?
BILL THIES: Same, M?
AUDIENCE: Yeah.
BILL THIES: OK OK, this is a hypothesis.
Other hypotheses?
AUDIENCE: N, M, 1.
BILL THIES: OK, anyone else?
Any amendments?
AUDIENCE: How about 1, 1, 1?
BILL THIES: 1, 1, 1?
OK lottery is closing.
Yep?
AUDIENCE: !, M, M
BILL THIES: 1 M, M?
1, M, M, OK and last call?
OK I believe two of the ones submitted are correct.
Is this and, yeah?
OK I think this, N, M, 1, works and 1, N, M, works.
So let me explain the 1, N, M, this is how I like
to think about it.
One way to think about this is we just want to
move the whole matrix.
You doing OK, yeah?
OK we just want to move the whole
matrix into this splitjoin.
So the way we can do that is have M columns of the
splitjoin since we have N columns of the matrix.
And what we'll do is we'll just do a round-robin one at a
time, from the columns of the matrix into the
columns of the splitjoin.
So we'll take the first element, send it to the last,
next element, next column, and so on.
So we get the whole matrix here, Now I want to read it
out column-wise.
So we'll do a joiner of M. We'll read M from the left
stream that'll read all M items from the columns, send
it out, and then M items in the next column, and
then send it out.
Does that make sense?
How many people understood that?
All right,
So if you think about it, you can also do it in M, N, 1.
That's basically, yeah, it's very similar.
OK there we were.
And yes.
We can do 1, N, M. We basically read the matrix
down, and then pull it down into column.
And it's very easy to write this as a transpose.
So we just have a transpose filter in
which we're doing nothing.
No competition in the actual rows or the actual contents of
the splitjoin.
And we just split the data by 1, have N different identity
filters, and then join it back together by M. Any questions
about this?
An interesting way to write a transpose.
OK so there's one more opportunity to shine here.
And that's a little more interesting permutation called
a bit-reversed ordering.
And so this comes up in an FFT and another algorithm.
The permutation here, is that we're taking the data
at the index n.
And let's say n has binary digits b sub 0, b sub
1, up to b sub k.
And we want to rearrange that data, so this data goes to a
different index with the reversed bits of its index.
So if it was and index n before, it ends up at b sub k
down to b sub 1, b sub 0.
So for example, let's just look at
3-digit binary numbers.
If we have 0, 0, 0, this is 1 input item.
We're reversing those digits, you still get 0, 0, 0.
Item at index 1 will be 0, 0, 1.
We want to reorder that to index 4.
OK, 1, 0, 0, 0, 1, 0, stays the same, 0 1, 1, goes to 1,
1, 0 shifts over.
And from there on, it's symmetric.
AUDIENCE: [UNINTELLIGIBLE PHRASE]
BILL THIES: Sorry?
AUDIENCE: [UNINTELLIGIBLE PHRASE].
BILL THIES: OK.
So here I'm writing the indices.
So I'm not writing the data.
So index 0, 1, 2, 3 up through index 8., or index 7.
OK and on the bottom you have indices 0 through 7.
So the data will actually be moved, reordered like that.
Does that make sense?
It's a reordering of data.
Does this transformation make sense?
Other questions?
OK, it turns out you can write this as a splitjoin.
And you just need 3 different weights for the round-robins.
OK round-robin with 1 weight on the top here, and two
different round-robin weights on the bottom.
And here I'm assuming you have 3 binary digits in your index.
So you're reordering in groups to 8.
OK, so let me give you a second to think about this.
What are the values for these weights?
I'll give you a hint in just a second,
or ask another question.
Yes?
AUDIENCE: [UNINTELLIGIBLE PHRASE].
BILL THIES: So what we're doing here, is we're exposing
the communication pattern.
That's the thing.
If you write this in an imperative way, you end up
basically having your, you're conflating basically the data
dependencies with the reordering pattern.
So what I'm trying to convey here, is how you can use the
streaming model to show how you're sending data around.
Because when you're on an architecture-like cell,
everything is about the data motion.
You're taking data from one place and you're trying to
efficiently get it to the producers or the consumers.
And you really need to-- the compiler needs to understand
the data motion.
And also, it's just another way of writing it, which I
think it's actually easier to understand once you see it.
It's a way to think about the actual reordering from a
theoretical standpoint
Any wagers on this?
So what we to do, if you think about the bit-reverse
ordering, what we want to do is distribute the data by the
low-order bits.
And then gather the data by the high-order bits.
So you want a fine-grained parity when you're shuffling.
You can also do it the other way around.
It's totally symmetrical.
But one way to think about it is, you want a fine-grained
parity when you're distributing data, and then a
course-grained when you're coming together.
Anyone see it?
Give you ten more seconds.
Come on?
Yeah it is a little bit tricky.
OK well let me explain how it works.
So 1, 2, and 4.
Ok so what these round-robin 1 splitters do, is these are
basically the fine-grained parities.
So OK, the first round-robin, that will send all the even
bits to the left, and all the odd values to the right.
Right, that's the lowest order bit.
Because it's doing every other one,
shuffling it left and right.
So this round-robin is seeing only the even values.
Now it's going to split them up based on who's
divisible by 4.
Now we'll go to the left or go to the right.
This is basically shuffling in the order of the bits, from
low-order bits to high-order bits.
So these will be ordered in terms of their low-order bits.
And now we just want to read them out from left to right.
Just take the order that we made with those round-robins,
and read them out from left to right.
And since they have 8 values, you can do that just by
chunking them up.
We'll read 2, 2, and then we'll read these two, 2 and 2,
and now put all 8.
Does that make sense?
OK, so yes, it's a bit clever.
So I think it's a nice way of thinking about what a
bit-reversed ordering means.
And you can write this in a very general way.
You just have a recursive bit-reversed
filter for N values.
And base case, you only have 2.
So there's no reordering to do when you when
you think about it.
So you're not doing any computation.
Otherwise yo have a round-robin split in half, and
then have a coarse-grain joiner.
So, you get a structure like this.
If you're, as you're building up, just distributing and then
bringing back together in a course-grained way.
OK.
Let's see how do I don't do this?
OK so one thing to notice, there's one more
example of a splitjoin.
Question?
AUDIENCE: [UNINTELLIGIBLE].
BILL THIES: OK so in general, at the base of this hierarchy,
we could've added some other filter to do some competition.
Identity just means we're doing no computation.
It's a predefined filter that just does nothing.
PROFESSOR: On complex data.
BILL THIES: On complex data.
Sorry this is a templated filter.
So we're reordering complex values.
And we're passing the input to the output.
Amir?
AUDIENCE: [UNINTELLIGIBLE]
BILL THIES: In general the language does not have support
for templates.
We only do it for these base classes.
That's more of an implementation detail.
Yeah.
AUDIENCE: [UNINTELLIGIBLE].
BILL THIES: Right now there isn't, but
nothing fundamental there.
Yeah?
Other questions?
Yeah?
AUDIENCE: How did you know that there are two filters
after that?
BILL THIES: Two filters, sorry here?
AUDIENCE: [UNINTELLIGIBLE].
BILL THIES: OK.
OK.
So we have two add statements between the
split and the join.
So that branches to two parallel streams.
Is that your question?
AUDIENCE: Yeah.
How do you that theorem, that there's not like three.
BILL THIES: So the compiler will analyze
this, the compile time.
And it'll know these values of N and propagate them down at
compile time.
So it'll basically symbolically evaluate this
code, see there are two branches, and you can unroll
this communication pattern.
AUDIENCE: I think another way of thinking about it is, each
add statement essentially adds another
branch in your splitjoin.
PROFESSOR: That is another box.
BILL THIES: It's about to get clear actually.
Other questions?
AUDIENCE: [UNINTELLIGIBLE]
BILL THIES: That's one way to think about it, yeah.
Wait say again, counting?
AUDIENCE: A rating sort.
BILL THIES: A rating sort, right.
OK, well is it sorting?
It's not really sorting.
AUDIENCE: Well it's not really sorting.
BILL THIES: It's a permutation.
AUDIENCE: Could you do a rating sort?
BILL THIES: You could do a rating sort.
So actually, what I want to show next is how you can morph
this program into a merge sort by changing only a few lines.
OK look carefully.
Don't blink.
OK there's merge sort.
So very similar pattern.
This is one of those idioms. It's a recursive idiom with
splitjoins.
But now in the base case, we have a sort.
So we would basically branch down.
What we ended up with was a sort in the base case.
We're just sorting a few values.
And we call [? merge sort ?] twice again, and
then we do a merge.
So instead of identity at the base case here, we now have a
basic sorting routine.
And we merge those results from both sides.
And the only thing I changed in terms of the communication
rate, is to be more efficient we just distributed data in
chuncks instead of doing a fine-grained splitting.
Actually you do it however you want in a merge sort.
But this is chunked up.
And let's just zoom in here.
This is how a merger sort looks in StreamIt.
So we split the data two ways, both directions, come
together, do a sort on both sides, and then you merge.
And so by having the, you know you can interleave pipelines
and splitjoins like this.
So you have these hierarchical structures that are coming
back together.
Does this make sense?
OK.
I'm going to hold off on messaging actually.
Let me see, what do I want to cover?
OK let me actually skip to the end.
Oh, I can show you this.
Yeah, I'm going to cut short a little bit.
So here's how other programs look written in StreamIt.
OK, you can have a Bitonic sort.
OK so you see a lot of these regular structures.
And the compiler can unroll this and then
match it to the substrate.
This is how and FFT looks.
It's quite an ah elegant implementation of an FFT.
It'd be good to go into in more detail.
You can do things like block matrix multiply.
You don't always have to have column or row-wise ordering.
It's natural to split things up like this.
We have a lot of DSP algorithms, the filter bank,
FM radio with equalizer, radar array front end.
Here's an MP3 decoder.
And let's see, I'm going to skip this section and just
give you a taste for the end here.
I'm skipping a hundred slides.
Yeah.
OK so if I give you a feel.
Our biggest program written in StreamIt so far, is the
complete MPEG-2 encoder and decoder.
So here is MPEG-2 decoder.
And I think you've seen block diagrams of this
already in the class.
And so it's a pretty natural expression.
You can really get a feel for the high-level structure of
the algorithm mapping down.
And for example, here on the top we're doing the spatial
decoding looking inside each frame.
Whereas at the bottom we're doing the temporal decoding
between two frames, the motion compensation.
And one thing that I didn't have a chance to mention, is
that we have a concept of teleport messaging.
What this means is, I showed you how the steady state flow
data goes between these actors in the stream.
But sometimes you want to control the stream as well.
For example, this is a variable length
decoder at the top.
It's parsing the input data.
It might want to change how the processing is happening
downstream.
For example, say that you have-- you know in this case
you have different picture types coming in.
And you want to tell other components to change their
processing based on a non-local effect.
And that's hard to do if you just want static data rates.
But what we have is this limited notion of limited
dynamism, where you're basically poking into somebody
else's stream.
And we let you do that very precisely.
I don't have time to go into the details, but you can
basically synchronize the arrival of these messages with
the data that's also flowing through the stream, And so in
this case, were sending through the picture type.
And it really simplifies the program code.
I didn't have time for details, but why don't we put
in the slides anyway, if you're interested.
And if you do a similar communication pattern in C,
it's a little bit of a nightmare.
You have all these different, basically memory spaces,
different files.
And the control information is basically going left
and right all over.
So this really helps both the compiler and the programmer as
well in StreamIt.
So it's all implemented.
It's about 2,000 lines of code in StreamIt.
Which is about 2/3 of the size of the C code, taking into
account similar functionality there.
And it's a pretty big program.
You can write 48 static streams. And then we expand
that to more than 600 instantiated filters.
So this gives you a lot of flexibility when you're trying
to get parallelism.
Question?
AUDIENCE: When a compiler downloads all bytes?
BILL THIES: Oh the object code, you mean.
OK, so right now our current implementation, we duplicate a
lot of code.
So it end up being bigger than it needs to be.
There's no reason for us to do that.
That's kind of a-- we have a research compiler
that make that easy.
AUDIENCE: So object-wise , its not data.
BILL THIES: Object-wise we still need to do that
comparison.
Yeah that's a good question.
Yeah.
Other questions?
OK so let me cut to the end.
OK, so we have the StreamIt language.
And we think it really
preserves the program structure.
It's a new way of thinking about how you orchestrate the
data reordering with the splitjoins, showing you who is
communicating to who, and how you can stitch together
different pieces in your program development.
And again, really our goal is to get this scalable multicore
performance.
But you can't get people on board just on
a performance stat.
You need to show them a new programming model that
actually makes their lives easier.
So that's what we're working on.
And thinks with listening.
[APPLAUSE]
BILL THIES: Any last questions?
Yes?
AUDIENCE: So in the anti-decoder, you have a lot
of computations size that were not sequential streams. Like
for example, the output of the distinct cosine transform is
not a stream of pixel, you are going to have coefficients and
things like that.
Which are a logical sort of chunk.
BILL THIES: Yes so depending on the granularity of the
competition, you don't need to pass individual values over
the stream.
For example, you can have a stream that inputs the whole
array at a time.
And so we basically advocate that if you have something
that course-grained, you should be passing an array or
a macroblock, in the case of MPEG, or a set of coefficients
in a structure.
So when you have coarse-grain parallelism, you write your
program in a course-grained way.
The fine-grained things I showed for the bit
interleaving and so on, is more for the fine-grained
programs. AUDIENCE: Can you do both?
In the sense that can you stream over an array, so it's
stream of stream, so to speak.
BILL THIES: So there's an interesting multidimensional
problem there.
Right now we've taken a 1-dimensional approach.
So far it's basically the programmer has to set an
iteration order, and end up with a 1-dimensional stream
coming into and out of every filter.
We're working on extending that.
Yeah, but when you have basically streams of
2-dimensional data, you like the freedom to either iterate
basically in time or in space, depending
on what you're doing.
And so I think that's more of a research problem.
So far we're just been doing a 1-dimensional representation.
Yeah good point.
Other questions?
Yeah?
AUDIENCE: Why did you decide on this synchronous dataflow
model as opposed to something more general?
BILL THIES: So our philosophy has been that you want to
start with the most kind of basic block of a stream
program, and optimize it really well.
And then you can stitch those together into
higher level blocks.
So we think of synchronous dataflow as being kind of the
basic block of streaming.
You know what's coming in, you know what's coming out.
And even if you have a more general model, they'll be
pieces that fit under their synchronous dataflow model.
And so we saw a lot of optimizations
opportunities in there.
And really knowing those IO rates can let you do a lot of
things that you can't do in a general model.
So I wanted to get the simple case right first. And actually
kind of our focus now is on expanding, and how do you look
at the heterogeneous system, and how do you optimize a more
dynamic system.
Yep.
Other questions?
OK.
Yes.
You can check out our web page.
Yeah if you Google for StreamIt, I'm sure
you can find it.
Yeah, we have a public release.
Yes, send us any problems. It's actually it's a good
test. We want to make sure it works for everyone.
But I mean, we've had, you know, hundreds of downloads.
There are a lot of people using StreamIt.
It shouldn't break if you download it.
Yeah.
OK good.
Thanks.