Iu Cloud Computing Mooc - Hbase 3 - Building with hbase

So this is how that data would get stored in the real physical storage. [pause] So... I mentioned to you a concept called column families earlier, but I didn't go into details. So when you look at the actual storage of the data, so it stores in terms of column families. So whatever the data that comes under a one-column family would be stored in a single... [pause] in a sequential... I mean, in a set of regions which are sequential. [pause] So we have a column family for the BasicInfo, and then whatever the key values or keys that comes under the BasicInfo would be stored in this notation in the disk. And then when you store this data, it would be stored... first... so the data would be stored in a sorted order. So the first... sorted ordering would be happen... based on the row key. And then based on the column qualifier... based on the... the qualifier of the column, in this case the Name. And then the third would be based on the Time Stamp. So when bin sorting based on the Time Stamp it would be... descending, so whatever the latest one would be the... that's [unknown] from the top. So the advantage of this... storage format and then this sorting would be... the access patterns. So for example, if you want to... get operation where you want to get, let's say the email addreses of all the people, or... you want to get whatever information in the BasicInfo field for all the people or a set of people, then all this data would be stored consecutively. And then if you want to take only a partial... region of this data, then the data is sorted. So it's very easy to query. And then the access would be much faster because you'd be accessing sequential data, which are in the physical system. So another concept which HBase has is, so this physical storage can grow up to gigabyte or terabytes of... in terms of the size. So in that case, HBase divides this... physical table into things called regions. And then these... so for example, there can be a region up to 'bbb' and then another region from downwards and then several more regions. And then these regions would be stored distributively across the cluster. [pause] So let's look... have a look at the HBase architecture. So you have the client and then you have HBase master nodes, which takes care of assigning regions and taking care when the regions ever stop running, and then you have the region servers, which takes care of a set of regions. So this region... so if you look at this region server it would be responsible for managing and storing region 'A', for example. And then this region is actually stored across multiple nodes for performance as a less fault tolerant durability perspectives. And then these regions would be responsible for that. But in case... if the traffic gets high for this region, and there's gonna be lot of activity in this region, then the HBase master node will notice that this is becoming a bottleneck, and then it will automatically partition this region into two regions and store in two nodes across the cluster, so which will improve the performance automatically. [pause] So if you look at use cases, Facebook uses HBase to store the information about the URLs you share, and then Twitter uses HBase to store the messages, and Mozilla uses it to store the crash reports. So if you go to the HBase website, there's a huge list of companies and use cases where they... called PoweredBy where they show the use cases of HBase, and how they're using HBase. [pause] So, let's move into programming with HBase. So HBase has different API, so HBase has HBase shell... which... where you can log into the shell and manipulate the tables... create tables, delete tables, put data, get data, and then it has a Java API as well as other APIs like a Thrift API and then a REST API. So the other thing is HBase has a MapReduce API that makes it very easy to do MapReduce computations on... atop of the data that I stored in HBase. And then you can write these MapReduce computations even using Hadoop-based languages like Pig or Hive. So when you're using HBase's Hadoop MapReduce, HBase provides several utility classes for you to use. So for example the TableInputFormat is a HadoopInputFormat which takes care of... [pause] inputting the HBase data, the data stored in HBase as key/value records to map tasks, as well as takes care of splitting the data to create map tasks. So in this case if you're using TableInputFormat, whatever the data stored in a single region would be loaned to a single map task. And then it also provides mapper and reducer base classes to use, and some utility classes which we'll do in configuring your MapReduce computation, and then several writable types. So the writable types are the datatypes in Hadoop MapReduce computations. So HBase provides a set of writable types which you can use to interact with data stored in HBase. So this is example of a MapReduce computation with... that we're gonna do on top of data stored in HBase. Yeah, so this is a driver program where we configure a job and then submits it. Since we are running out of time, I'm not really going to go into details, but you can get your configuration... HBase writes a method, and then you can define a scan where you will define the input data for your MapReduce computation. And then you can use the utility methods like... initTableMapper, or the initTableReducer to specify which table to use and then which mapper class to use and what is the input format and what is the output format. And you can do the same things using the TableReducer... initTableReducer job to configure the... reducer part of your computation. And then if... when you're writing a mapper with HBase, you can use this TableMapper base class to... implement your mapper, and then since you... when you... in the earlier side... in the earlier slide we used the table input format, so when you used that, you automatically get the data in HBase as these two HBase built-in datatypes, so your key would be ImmutableBytesWritable, and then the value would be Result... [pause] So in the Reducer, you will get the normal Hadoop intermediate transfer and the normal datatypes, whatever you output from the mapper, and then the important part is you... HBase API support storing this data back into HBase tables using... these datatypes and when we configure the MapReduce computation using that earlier... utility methods, it automatically sets the output format as the Table output format. So whatever the data you output here using this 'put' datatype would be stored in the table you configured in your driver program.