Tip:
Highlight text to annotate it
X
Pete: So tell me about the next layer of your architecture. You're doing some interesting
things on the data side using NoSQL technologies and other things. Talk to me a little bit
about that.
Dag: For every single bid request we get in we actually have to do some sort of storage
lookup. And the way that is done, actually we have several, usually two look-ups per
request. So, if we're doing 25,000 requests per second, then we have to do about 50,000
look-ups.
And also, we do store information about them, so we have also fairly substantial write throughput.
And this is like one of those obvious situations where traditional, very general, database
technologies fall through. We could probably scale the reads pretty well up to that amount.
But the writes would be very, very hard.
And also, because we're so latency constrained, having a general database engine to add 2-3
extra milliseconds is just a waste of time.
Pete: So because of this, how many different types of data stores do you have?
Dag: So, we have three types basically. We have the key value storage, which is used
for reads and writes during the time critical bidding process. Then we have traditional
RDMS which is, in that case MySQL, that we use for the more business oriented, the standard
things that are nice to put in a database.
So, we have our, well, ledgers and transaction logs for money flow. Not for the high frequency
stuff, but for the more low frequency stuff. And then we also have a more big data type
store, which we use for doing analytics and queries on. The really, really big data sets.
We actually have a fourth.
We also have our unstructured, or they're actually structured, but we have just raw
logs that we can run more with MapReduce ad-hoc queries over if everything else fails.
Pete: And I think we spoke previously. You're looking at some different solutions for NoSQL
data sotres. Talk to me a little about the options there.
Dag: Yeah, there are a lot options for an NoSQL right now. And, so first of all I like
the new explanation for NoSQL, which is “not only SQL”. And this is all about finding
the right tools for the job and the different NoSQL solutions are usually very specialized.
There are some general ones like Mongo and CouchDB that are fairly general NoSQL solutions
and they also they have sweet spots but in our case we were looking at something very,
very simple.
We just actually have key and values, that's all we need. Something with extremely low
latency and something that performed consistently and predictably. And we looked at a number
of different projects including open source projects have actually started running on
open source projects. I'm not going to bash them right now, but we ended up actually going
for a commercial vendor.
Which I'm not going sell either right now.
Well, yeah. They're called Citrusleaf. It's a really good company, actually. They deserve
a lot of credit. They took a lot of the jitter in our performance just right out the equation,
out of the equation. So, very, very simple yet very capable NoSQL solution that performs
insanely well. This is the stuff that we're running on our SSD drives in a fairly small
cluster, but can still take immense, the throughput is awesome.
Pete: So you mentioned that there's faster technology out there than the typical MapReduce,
Hadoop stack, that many engineers are becoming familiar with. Talk to me a little bit more
about the options there, and what you guys are looking at.
Dag: First of all, MapReduce is just a principle on how you can process or split big data into
smaller chunks and then process in parallel and then try to reduce it into something meaningful
in the end. And there are a lot of generalized MapReduce frameworks, the most of well known
one is called Hadoop, which is basically a rather large stack of software which is founded
on the principal of MapReduce.
We started using Hadoop. It does have some strengths as in that, it's open source and
it's fairly easy to get started with. We've had some issues running into bug It's fairly
difficult to get everything up and running there. But it worked all right. But we were
seeing, for the things we do, we need shorter response times.
We needed to do queries over really large data sets, and we just found out that Hadoop
didn't perform well enough for us. We were using Amazon's elastic MapReduce framework,
or their cloud offering, which also works pretty well, but it just didn't give us the
performance we wanted. And the fundamental reason for this is probably or at least into
large degree that when using Hadoop and you are using raw log files underneath.
You actually have to, if you have thirty terabytes of data, then you have to read thirty terabytes
off a disc. Whereas, very often, you just want to pick, let's say, one of the queries
we might run over this data set, you want to see how, what's our over the course of
a month, how many ad impressions could we sell to iPhone users in North America, and
all we're interested in them is the platform. We detect it and maybe some IP ranges or DMA
or something like that. But, just a couple of fields.
Pete: So you're talking about selecting a value out of just one column, out of a data
set that might have many columns in the row.
Dag: Yeah. So, that's exactly it, right? And so Hadoop is brute forcing a lot of these
things. What you really want to do is just, well, this is just like the poster-boy for
column-oriented databases. There are a few open source ones there, too. So, HBase is
one, but Cassandra is probably one of the more well-known there.
HBase then Cassandra. There is an interesting project going on I know that the Datastax
guys are doing, with running Hadoop on top of Cassandra, which then, at least, takes
away the storage and load, I/O, problem out of the equation. But it turns out that there
are a lot of proprietary commercial vendors here, that offer solutions that still just
blows anything else out of the water.
Some of them are actually hybrid open source as well like Infobright it's based on MySQL
and has a proprietary storage engine. And has a communication, but you actually have
to. If you want it. We're not using them, but if you want these solutions you actually
have to pay for them.
At least for now. And then they use compression, column-oriented storage, and they just, it's.
I recommend anybody that's looking for analytics, and especially real-time analytics, to take
a look at these offerings. I know that if we were trying to do this with something like
Hadoop, well, we would have made it work.
We could do rollups on a regular basis, all those kinds of traditional workarounds for
it, but with these solutions you can actually just run your queries and SQL and they will
respond in sub second even over hundreds and millions of rows.
Pete: And a couple of those players again are?
Dag: So the ones that we've been looking at are Vertica, Infobright, Aster Data, Greenplum.
and another player called VectorWise, which is from Ingres.
So, they all perform incredibly well. They do start to change a bit when you start doing.
So, we have two types of big data basically. We have the impression tracking, which is
based per campaign. That's still in the region of maybe a couple hundred million rows that
we do analytics on.
And then we have this firehose, which is in the thirty billion row region. And when you
start getting up to those levels they start to differ a bit due to the way they're architected,
some are easy to scale out where as others actually have to be scaled up. So they're
different, but they're all worth a look.