Tip:
Highlight text to annotate it
X
So now let's get to the hands-on assignment.
So I mentioned that Hadoop is this...
toolkit for doing lots of great things.
So you don't actually have to worry about all the intricacies
of scheduling a fault tolerance and things like that.
The only thing you have to be aware of is your data,
your application and trying to actually map that
to the programming model and key/value pairs.
[pause]
So here is just the classic WordCount example,
and here I have my map function and my reduce function.
So also when I mention that when we work with WordCount,
the keys are actually the offsets of the file, so we actual...
really care only about the... words as my values. And the job client,
as I mentioned in the previous slides, is actually what runs the job.
So here is also configuration parameters that you can tweak.
[pause]
Stephen had introduced you guys to his framework
for... an environment for actually running Hadoop,
so at this point, you guys should already be
good to go with doing programming stuff. So... I'll assume that.
Also I think it's important to note that... y'know, Hadoop is written in Java,
but it does enjoy flavors in Python and C... C++...
with this thing called Hadoop streaming and the C++... live api.
So let's... so I just.. so Stephen's... the difference...
so Stephen... has a... his framework is batch mode.
Here I want to show you guys an interactive way
of programming. So with the interactive way
you guys can actually be comfortable with HDFS commands
and actually seeing how would you start the name nodes
and the data nodes and all that good stuff. So I think I'll...
[pause]
So in this part... so here I'm just setting up the virtual machine, right?
So here I log in as 'root', and the password as 'school2012'.
And all of this, of course, is uploaded to the site. So now I...
so I prepared this WordCount [unknown] to ease... you in
the source code and the input file, which is just a document.
And also I have this bill script. This bill script,
what it does, essentially it compiles your Java program,
and it copies the jar file to the Hadoop bin... so that
you don't have to worry about your class path and all that stuff.
So here I'm just extracting the WordCount.
[pause]
So now... I kinda jumped ahead of myself,
but I'm compiling it now... using the bill script.
[pause]
So now that I've compiled that, I'm going to the bin of the Hadoop.
So now I actually want to format the HDF... the name node.
[pause]
So also I mentioned that when you format the
name node, you only need to do that once.
[pause]
So now that that's formatted, I'll just start my
HDFS Daemons and my MapReduce Daemons.
[pause]
(new speaker) So the... oh, sorry.
(Jerome Mitchell) Oh, that's fine.
(new speaker) So the format on the name node,
what does that do?
(Jerome Mitchell) So essentially it just wipes the HDFS,
and it just formats it, that's all it does.
[pause]
So when you start the HDFS Daemons and the MapReduce
Daemons, it's a good way to check to see if they're running.
So here I have the name node, the secondary name node,
the data node, and the task tracker and the job tracker all started.
So now I'm... wanting to create a directory in HDFS.
[pause]
So I created a folder on the distributed file system.
[pause]
So now I'm actually wanting to copy that... the data
in the Hadoop WordCount input, copy that onto the HDFS.
[pause]
So now I'm actually gonna execute the Hadoop.
[pause]
The WordCount example.
[pause]
So here the arguments are Hadoop, the jar, the jar
of the WordCount, the class, and the input and the output.
The... one of the things about the output you don't need to
create a... an output folder on HDFS, it's automatically created.
[pause]
So here... it's showing that it's done, the map phase
and now it's doing the reduce operation.