Iu Cloud Computing Mooc - Igpay atinlay 1 - Igpay ademay easy

So it seems that some people prefer dinner than a presentation, and I hope everything goes well here, and I'm going to start with the Pig Latin part. And this part is prepared by me and Professor Judy Qiu. First we'll go through the simple introduction of the Pig Latin system. Pig Latin is a framework that... used to analyze large scale unstructured and semi-structured data, and run on top of Hadoop cluster. It consists of two components. The first is the Pig Engine. The second is the Pig Latin program interface. And the Pig Engine is the compiler and the runtime and used to parse and compile; change these Pig Latin statements into MapReduce job and run on top of Hadoop cluster. The Pig Latin language is a high level language for Pig Engine system. It is explicitly declarative SQL-like language and you will see even later writing Pig Latin application is very simple and here is the... this chart, the left column of this chart, is Pig Latin statements, and the right column of this chart is the logical execution plan of some Pig Latin statements. And these statements will [unknown], optimize and run on Hadoop cluster. [pause] The main motivation of using Pig Latin technology is pretty simple, that it's used to accelerate development. You will see even later, writing Pig Latin application is very simple. For some applications, it is as simple as writing a set of SQL queries, and for complex cases, developers can integrate a user defined function inside of Pig Latin statements. In addition, you can also write some Python scripts and embedding some Pig Latin statements inside of it. So this is some advanced feature of Pig. And here is an example to show this motivation of using Pig Latin. This example is to find the Top 5 words with most high frequency. And you will see later that the program to find the Top 5 words with Pig Latin statements just cost ten total lines, while Java MapReduce version is cost 200 lines, and it would take only 15 minutes in Pig Latin while the Java MapReduce version would take about 4 hours. So the goal of use Pig Latin technology is to accelerate programming, and this is one of our main purpose to introduce this kind of technology to the students. And here is the WordCount using the MapReduce. You can see that to implement the WordCount application, you have to implement a map and a reduce function, where the basic idea Professor Chandra already introduced in previous session. And here is the WordCount using the Pig Latin statements. You can see that it just costs 4 code lines to implement the WordCount applications, while you can use additional Pig Latin statements to implement the order and pick up the Top 5 words with high frequency. And all this stuff is done within seven code lines, so this is very concise and easier. [pause] So we have shown that the programmability of Pig Latin system is very good, then how about in performance? There's a project named Pigmix, which is... consists of a set of Pig Latin statements that used to evaluate the performance comparison with Java MapReduce implementation, and this kind of Pig statements consist of some scripts that are used to evaluate the scalability and the latency as well. And you will see that in the earlier version of Pig, the performance of Pig Latin system is not as good as... as Java MapReduce technology, while latest version, the Pig Latin technology is even a little faster than the Java MapReduce technology. The reason is that the Pig Engine system. The compiler compiles and optimize and compile optimize Pig Latin statements changed to optimize job execution flow. And for example, the Pig Engine can parse the Pig Latin statements and remove some data redundancy before it do a join operation between two tables. And this can use it to decrease the computation amount of specific task, especially join task, so this is one reason Pig Latin technology can be... even faster than the Java MapReduce technology, because it can explore the calculation of... between the Pig Latin statements. Here's other Pig highlights in addition to the programmability and the performance. First one is the UDF. It can be written to take advantage of combiner. If the UDF satisfy the associative and the communative rules, this kind of computation can be done in parallel in Hadoop cluster. And it also provide four type of join implementations. And the third highlight feature is writing load store function is easy once InputFormat and OutputFormat exist. This is similar to Hadoop and HBase. There is some system embedded Input and OutputFormat, so you can... deal with this kind of input conveniently without repeat the work. And there's other some features. Then we come to the background: who use Pig Latin, and for what. Usually there's other high level language, like Pig and Hive. The difference between them is that Pig, plus Hadoop, prefer to process data stored in the disk, while the Hive and HBase prefer to process data stored in the memory. As you know, the disk's capability and space is much larger than the memory [unknown]. The Pig technology is usually used for some commercial companies to process large scale [unknown]. And about 70% of production job at Yahoo! is processed by the Pig technology. And there are also some other commercial company, like Twitter, LinkedIn, Ebay, and AOL use Pig technology to process applications, including the processing web logs, build user behavior model, process images, build mappers of the map and can be used to calculate page rank, and do the research on large datasets. Okay, here we come to the Pig hands-on. We will first introduce accessing Pig, and then we will deliver two samples, the WordCount and Kmeans clustering. In first example we will go through basic Pig knowledge, include the data type, operation, and how to run Pig scripts. And in the second example we will go through some advanced feature in Pig Latin, including the embedding... Pig embedding API technology and user defined function technology. And how to access Pig? There is... a couple of way to access Pig framework. First is the batch mode. You can submit Pig Latin statements as a batch job. It's similar to qsub in the PBS system. The other approach is the interactive one that you can submit the Pig Latin statements line by line and get result immediately. It also provide some third-party programming language interface that user can make programs to describe more complex business flow with Pig API. It also provide two execution mode. One is the local mode. It's basically long Pig Latin statements in a stand alone machine, and this is very convenient for the debugging purpose. The second is the MapReduce mode. It is usually compile the Pig Latin statements and run on a Hadoop cluster. This is very practical model.