Tip:
Highlight text to annotate it
X
So this is how that data would get stored in the real physical storage.
[pause]
So... I mentioned to you a concept called
column families earlier, but I didn't go into details.
So when you look at the actual storage of the data,
so it stores in terms of column families.
So whatever the data that comes under a one-column family
would be stored in a single...
[pause]
in a sequential... I mean, in a set of regions which are sequential.
[pause]
So we have a column family for the BasicInfo,
and then whatever the key values or keys that comes under
the BasicInfo would be stored in this notation in the disk.
And then when you store this data, it would be stored...
first... so the data would be stored in a sorted order.
So the first... sorted ordering would be happen... based on the row key.
And then based on the column qualifier... based on the...
the qualifier of the column, in this case the Name.
And then the third would be based on the Time Stamp.
So when bin sorting based on the Time Stamp it would be...
descending, so whatever the latest one would be the...
that's [unknown] from the top. So the advantage of this...
storage format and then this sorting would be... the access patterns.
So for example, if you want to... get operation where you
want to get, let's say the email addreses of all the people,
or... you want to get whatever information in the
BasicInfo field for all the people or a set of people,
then all this data would be stored consecutively.
And then if you want to take only a partial... region of this data,
then the data is sorted. So it's very easy to query.
And then the access would be much faster because
you'd be accessing sequential data, which are in the physical system.
So another concept which HBase has is,
so this physical storage can grow up to gigabyte or terabytes of...
in terms of the size. So in that case, HBase divides this...
physical table into things called regions. And then these...
so for example, there can be a region up to 'bbb'
and then another region from downwards and then
several more regions. And then these regions
would be stored distributively across the cluster.
[pause]
So let's look... have a look at the HBase architecture.
So you have the client and then you have HBase master nodes,
which takes care of assigning regions and
taking care when the regions ever stop running, and then you have
the region servers, which takes care of a set of regions.
So this region... so if you look at this region server it would be
responsible for managing and storing region 'A', for example.
And then this region is actually stored across multiple nodes
for performance as a less fault tolerant durability perspectives.
And then these regions would be responsible for that.
But in case... if the traffic gets high for this region,
and there's gonna be lot of activity in this region,
then the HBase master node will notice that this is becoming
a bottleneck, and then it will automatically partition this region
into two regions and store in two nodes across the cluster,
so which will improve the performance automatically.
[pause]
So if you look at use cases, Facebook uses HBase
to store the information about the URLs you share,
and then Twitter uses HBase to store the messages,
and Mozilla uses it to store the crash reports.
So if you go to the HBase website, there's a huge list of
companies and use cases where they... called PoweredBy
where they show the use cases of HBase,
and how they're using HBase.
[pause]
So, let's move into programming with HBase.
So HBase has different API, so HBase has HBase shell...
which... where you can log into the shell and manipulate the tables...
create tables, delete tables, put data, get data,
and then it has a Java API as well as other APIs
like a Thrift API and then a REST API.
So the other thing is HBase has a MapReduce API
that makes it very easy to do MapReduce computations on...
atop of the data that I stored in HBase.
And then you can write these MapReduce computations
even using Hadoop-based languages like Pig or Hive.
So when you're using HBase's Hadoop MapReduce,
HBase provides several utility classes for you to use.
So for example the TableInputFormat is a HadoopInputFormat
which takes care of...
[pause]
inputting the HBase data, the data stored in HBase
as key/value records to map tasks, as well as
takes care of splitting the data to create map tasks.
So in this case if you're using TableInputFormat,
whatever the data stored in a single region
would be loaned to a single map task. And then it also provides
mapper and reducer base classes to use, and some
utility classes which we'll do in configuring your
MapReduce computation, and then several writable types.
So the writable types are the datatypes
in Hadoop MapReduce computations.
So HBase provides a set of writable types
which you can use to interact with data stored in HBase.
So this is example of a MapReduce computation with...
that we're gonna do on top of data stored in HBase.
Yeah, so this is a driver program where we configure a job
and then submits it. Since we are running out of time,
I'm not really going to go into details,
but you can get your configuration... HBase writes a method,
and then you can define a scan where you will define
the input data for your MapReduce computation.
And then you can use the utility methods like...
initTableMapper, or the initTableReducer
to specify which table to use and then which mapper class to use
and what is the input format and what is the output format.
And you can do the same things using the TableReducer...
initTableReducer job to configure the...
reducer part of your computation.
And then if... when you're writing a mapper with HBase,
you can use this TableMapper base class to...
implement your mapper, and then since you... when you...
in the earlier side... in the earlier slide we used the table input format,
so when you used that, you automatically get the data
in HBase as these two HBase built-in datatypes,
so your key would be ImmutableBytesWritable,
and then the value would be Result...
[pause]
So in the Reducer, you will get the normal Hadoop
intermediate transfer and the normal datatypes,
whatever you output from the mapper,
and then the important part is you... HBase API support
storing this data back into HBase tables using...
these datatypes and when we configure
the MapReduce computation using that earlier... utility methods,
it automatically sets the output format as the Table output format.
So whatever the data you output here using this 'put' datatype
would be stored in the table you configured in your driver program.