Tip:
Highlight text to annotate it
X
So now that I think we feel like we've got at least a decent
understanding of MapReduce from a high level, some of you much...
probably have been dealing with MapReduce
before even this summer school session started.
Let's... actually start to talk about... a little bit about a cluster.
So what exactly is a cluster?
Some of you may have had clusters in your cereral this morning.
Other of you may have been actually experiencing clusters
from a computing perspective. So if you go from this kind of
local node of what's a cluster, I'll start with my laptop,
but now if your problems are starting to get big enough where
I actually can't compute them within my own little laptop environment,
there's this whole concept of high-performance computing,
high-throughput computing, distributed computing.
We're starting to get this mode of, "I think I'm probably going to need
more than one computer to do my work. I'm gonna need to put my data
across multiple machines, I'm going to need to use multiple CPUs."
And so it starts to become a lot more complex,
now machines need to communicate with each other.
They need to... I need to be able to actually
have my data accessible from every single computer.
So it becomes a harder problem. And you say,
"Wait a sec, I'm not even a computer scientist.
All I want to do is solve my Bioinformatics problem, all I want to do
is solve my Linguistics problem. I just have a lot of data to deal with.
I want to go out and pull all this Twitter data
and I'm going to figure out something interesting out of it."
Well, how do I do that on a cluster? So again, these are complicated
problems which are maybe... dealing with a particular domain.
So let's dig a little bit into what exactly is a cluster.
So from a high level, a cluster is just a set of computers that have...
are [unknown] in somewhat close proximity. They have networking that's
making them act as kind of a single unified thing, this is where
this kind of term of 'supercomputer' can sometimes come into play.
There's typically storage that either can be distributed
across the cluster, or at least from a distributed manner,
could be accessible from other computers.
And you probably will have a resource manager tied to that,
and so this... what this means then is that... this resource manager's
actually understanding how are my other computers...
what are they doing, how much CPU usage do they have,
how much memory are they using, what's the disk...
I/O footprint look on those computers, because I may actually move work...
And you can think of this as almost like a traffic cop.
If I've got a computer that's busy, maybe I'm gonna
shuffle work over to a different one. So from a high level... just...
and this can be from kind of a high-performance computing space
as we get into this MapReduce space, some of these high-level
properties of a cluster now can be played into this Hadoop framework
that we were just talking about in the previous presentation.
I'm again solving some sort of an application problem,
and I don't necessarily wanna have to know all of these details,
I just want my application to work.
That's the beauty of sometimes having underlying frameworks
that can take advantage of some of these things for you
and actually putting the data in the right spot
and running on the right computer and storing it in a proper place.
So here's a very simple way to look at this.
I've got this computer, which called... sometimes it's called a head node.
And I've got these other computers that are sometimes
called the execute nodes. And we'll see a little bit different
terminology in this Hadoop space than the MapReduce clusters.
But a head node is typically... if anybody's ever logged into a
high performance computer, these are the computers
that you're going to log into, and you're kind of interacting on.
What you can do then is you can schedule work out
to these execute nodes, and they'll actually handle your task.
So what you'll actually see here then to is you've got
a file system that can distribute data across those nodes
and so that way then I can say, "Well, what if I want my application
to be accessible across every single one of the nodes?"
You may have it stored in a spot where it's
distributed across the entire environment.
So if we take this kind of view of a cluster, and say,
"Well, what can I actually do with this?",
let's take a high level look at a very simple paradigm
of breaking down large problems and running them on a cluster.
So there's a... kind of a programming paradigm out there called Scatter/Gather.
And if you take some input data, and you say well,
so let's take, as an example, file that I want to
get the WordCount on, back to this WordCount example,
I've got a million rows of data in this text.
I could have a very long... it could be a very long paper,
it could be... the Bible, it could be anything, very long set of text.
What if I want to do a WordCount across that? So I'll split that
across every single node that I have in my cluster,
maybe, in an even fashion, and now I can start to do WordCounts
across every single one of those nodes in this environment.
So I've gone from... this kind of diamond shape is a very nice model of...
I've taken this input data, I've split it across all of my nodes,
and then I've gathered my output, whatever that is,
whatever operation I'm doing, back to a final landing spot.
And so if we take a step back and look at our cluster,
I can do this same mentality with this. All of a sudden now
I've split all of my data across all of my nodes in a cluster,
I'm operating on each one of them independently,
and now I'm going to gather back my result, and whatever that result was.
So you start to see this in this example now.
I'm taking this Scatter/Gather and I'm actually now mapping that
into how it would work on a cluster. So I've got file system then,
and each one of my nodes splits the data and scatters it across the nodes.
I execute on each one of those things independently
and then I gather my data back into an output.
So from this... view here, what if I have a lot of data in this mentality?
So all of a sudden this is where... actually, this is why
Google changed their mentality of a cluster and started to
introduce this MapReduce framework. Because for them
what ended up happening in their problems,
because they were so large, was that the network actually
started to become a bottleneck in this framework.
So instead of trying to map data from a distributed file system
that's easily accessible from all of the nodes up to an execute node,
performing work on it, and then sending it back down to an output file,
they were saying that it's gonna be too costly
and our network and network technologies aren't keeping up
with the amount of data that we're dealing with.
And I think you can go out and look on Wikipedia
or something like this, but I think Google now is they're stating
out there publicly that they process about 24 petabytes a day.
So you start to think about how much data that actually is.
So what becomes your bottleneck in this?
Probably gonna be your network. So in this case,
if I'm gonna take this setup of what I see in a kind of a traditional cluster,
how can I start to change that mentality up?
I still need to be doing large computing,
how do I change my mentality, how do I change my cluster setup
so that I can actually operate on the data?
And so what they actually did then was they said,
"Well, why don't we make a file system that actually
put the data essentially on the nodes?"
So look at the difference in the files, we see this says
'file system' across each one of these execute nodes.
That's a different setup than having a networking connection
where the file system's actually accessible from all of the nodes,
what they've done now is they've said, "Well, why don't we just
go ahead and put the data on the nodes,
and we'll operate on each one of those independently.
We're already doing that anyway in this kind of scatter-gather mentality.
Why don't we go ahead and just rethink about how
we want to do our file systems and how we want to actually
build our clusters, and lets put the data actually out there."
So now I'm going to actually distribute my data across all of my nodes.
I'll probably do some sort of replication so that way then
if I have a node failure, again it still sits on another node,
so that's how you're getting this prevention of data loss.
And then I'll have the file system actually keep track of this
metadata for me so that I know which node's actually storing it.
So now I can actually say, "Send my compute actually to those
nodes where the data resides." And now I've kind of
changed my mentality a little bit of I still have a head node,
and I still have these execute nodes, and I still have this resource manager
that's saying, "Hey, where should I be scheduling my work?"
Well, schedule your work to where the data resides and then it'll be simpler.
Now I can actually minimize my network traffic because it says,
"Well, your 'hello, world' is already over there on that node,
let's go ahead and go do a WordCount on it."
So the scheduling resources now changes its job,
not only it says which ones are the most memory intensive,
I/O bound, CPU, but also where my data reside
and lets actually schedule tasks to that node
so that I can... or I can reduce my network traffic.
So again if we go back and look at this, what we've done now is,
back to this kind of splitting of data, we've actually done this split step
and put it out into the file system as the first thing we're gonna do.
Back to this example, I've split my data, and I went ahead and just
put it on the file system before I even started operating on it.
And that's what you start to see here. I've scattered
my data across all of these...and now I'm just gonna
come in with a map task, and I've already put it out there,
I haven't had them load it from the file system, that was done before.
So if you picture Google indexing the web,
they're grabbing all this data from web pages, and they're starting to
put the data across all their nodes in their data centers,
and now say, "What do I want to do on that data?
What's my next step that I want to do in this map task?"
And so that gets us a little bit into then some of our next step.
So... in our kind of checklist of things that we're doing here,
we've obviously just kind of talked about clusters.
We've gone a little bit from HPC clusters into
what exactly is a MapReduce cluster, with this kind of
little bit different setup. So I'll take you through a couple...
different topics then, specifically with virtual and with Cloud.