Tip:
Highlight text to annotate it
X
So it seems that some people prefer dinner than a presentation,
and I hope everything goes well here,
and I'm going to start with the Pig Latin part.
And this part is prepared by me and Professor Judy Qiu.
First we'll go through the simple introduction of the Pig Latin system.
Pig Latin is a framework that... used to analyze large scale
unstructured and semi-structured data, and run on top of Hadoop cluster.
It consists of two components. The first is the Pig Engine.
The second is the Pig Latin program interface.
And the Pig Engine is the compiler and the runtime
and used to parse and compile; change these Pig Latin
statements into MapReduce job and run on top of Hadoop cluster.
The Pig Latin language is a high level language for Pig Engine system.
It is explicitly declarative SQL-like language
and you will see even later writing Pig Latin application
is very simple and here is the... this chart,
the left column of this chart, is Pig Latin statements,
and the right column of this chart
is the logical execution plan of some Pig Latin statements.
And these statements will [unknown], optimize and run on Hadoop cluster.
[pause]
The main motivation of using Pig Latin technology
is pretty simple, that it's used to accelerate development.
You will see even later, writing Pig Latin application is very simple.
For some applications, it is as simple as writing a set of
SQL queries, and for complex cases, developers can integrate
a user defined function inside of Pig Latin statements.
In addition, you can also write some Python scripts
and embedding some Pig Latin statements inside of it.
So this is some advanced feature of Pig.
And here is an example to show this motivation of using Pig Latin.
This example is to find the Top 5 words with most high frequency.
And you will see later that the program to find the Top 5 words
with Pig Latin statements just cost ten total lines,
while Java MapReduce version is cost 200 lines,
and it would take only 15 minutes in Pig Latin
while the Java MapReduce version would take about 4 hours.
So the goal of use Pig Latin technology is to
accelerate programming, and this is one of our main purpose
to introduce this kind of technology to the students.
And here is the WordCount using the MapReduce.
You can see that to implement the WordCount application,
you have to implement a map and a reduce function,
where the basic idea Professor Chandra
already introduced in previous session.
And here is the WordCount using the Pig Latin statements.
You can see that it just costs 4 code lines
to implement the WordCount applications,
while you can use additional Pig Latin statements
to implement the order and pick up
the Top 5 words with high frequency.
And all this stuff is done within seven code lines,
so this is very concise and easier.
[pause]
So we have shown that the programmability of Pig Latin system
is very good, then how about in performance?
There's a project named Pigmix, which is...
consists of a set of Pig Latin statements that used to
evaluate the performance comparison with Java MapReduce
implementation, and this kind of Pig statements
consist of some scripts that are used to evaluate
the scalability and the latency as well.
And you will see that in the earlier version of Pig,
the performance of Pig Latin system is not as good as...
as Java MapReduce technology, while latest version,
the Pig Latin technology is even a little faster than the Java
MapReduce technology. The reason is that the Pig Engine system.
The compiler compiles and optimize and compile optimize
Pig Latin statements changed to optimize job execution flow.
And for example, the Pig Engine can parse the Pig Latin
statements and remove some data redundancy
before it do a join operation between two tables.
And this can use it to decrease the computation amount
of specific task, especially join task, so this is
one reason Pig Latin technology can be... even faster
than the Java MapReduce technology, because it can explore
the calculation of... between the Pig Latin statements.
Here's other Pig highlights in addition to the programmability
and the performance. First one is the UDF.
It can be written to take advantage of combiner.
If the UDF satisfy the associative and the communative rules,
this kind of computation can be done in parallel in Hadoop cluster.
And it also provide four type of join implementations.
And the third highlight feature is writing load store function
is easy once InputFormat and OutputFormat exist.
This is similar to Hadoop and HBase.
There is some system embedded Input and OutputFormat,
so you can... deal with this kind of input conveniently
without repeat the work. And there's other some features.
Then we come to the background: who use Pig Latin, and for what.
Usually there's other high level language, like Pig and Hive.
The difference between them is that Pig, plus Hadoop,
prefer to process data stored in the disk, while the Hive
and HBase prefer to process data stored in the memory.
As you know, the disk's capability and space is much larger
than the memory [unknown]. The Pig technology is usually used for
some commercial companies to process large scale [unknown].
And about 70% of production job at Yahoo!
is processed by the Pig technology. And there are also
some other commercial company, like Twitter, LinkedIn,
Ebay, and AOL use Pig technology to process applications,
including the processing web logs, build user behavior model,
process images, build mappers of the map
and can be used to calculate page rank,
and do the research on large datasets.
Okay, here we come to the Pig hands-on.
We will first introduce accessing Pig, and then we will deliver
two samples, the WordCount and Kmeans clustering.
In first example we will go through basic Pig knowledge,
include the data type, operation, and how to run Pig scripts.
And in the second example we will go through some
advanced feature in Pig Latin, including the embedding...
Pig embedding API technology and user defined function technology.
And how to access Pig? There is... a couple of way
to access Pig framework. First is the batch mode.
You can submit Pig Latin statements as a batch job.
It's similar to qsub in the PBS system.
The other approach is the interactive one that you can submit
the Pig Latin statements line by line and get result immediately.
It also provide some third-party programming language
interface that user can make programs to describe
more complex business flow with Pig API.
It also provide two execution mode. One is the local mode.
It's basically long Pig Latin statements in a stand alone machine,
and this is very convenient for the debugging purpose.
The second is the MapReduce mode.
It is usually compile the Pig Latin statements
and run on a Hadoop cluster. This is very practical model.