Not - Only sql - How tapad uses different dbs for different needs

Pete: So tell me about the next layer of your architecture. You're doing some interesting things on the data side using NoSQL technologies and other things. Talk to me a little bit about that. Dag: For every single bid request we get in we actually have to do some sort of storage lookup. And the way that is done, actually we have several, usually two look-ups per request. So, if we're doing 25,000 requests per second, then we have to do about 50,000 look-ups. And also, we do store information about them, so we have also fairly substantial write throughput. And this is like one of those obvious situations where traditional, very general, database technologies fall through. We could probably scale the reads pretty well up to that amount. But the writes would be very, very hard. And also, because we're so latency constrained, having a general database engine to add 2-3 extra milliseconds is just a waste of time. Pete: So because of this, how many different types of data stores do you have? Dag: So, we have three types basically. We have the key value storage, which is used for reads and writes during the time critical bidding process. Then we have traditional RDMS which is, in that case MySQL, that we use for the more business oriented, the standard things that are nice to put in a database. So, we have our, well, ledgers and transaction logs for money flow. Not for the high frequency stuff, but for the more low frequency stuff. And then we also have a more big data type store, which we use for doing analytics and queries on. The really, really big data sets. We actually have a fourth. We also have our unstructured, or they're actually structured, but we have just raw logs that we can run more with MapReduce ad-hoc queries over if everything else fails. Pete: And I think we spoke previously. You're looking at some different solutions for NoSQL data sotres. Talk to me a little about the options there. Dag: Yeah, there are a lot options for an NoSQL right now. And, so first of all I like the new explanation for NoSQL, which is “not only SQL”. And this is all about finding the right tools for the job and the different NoSQL solutions are usually very specialized. There are some general ones like Mongo and CouchDB that are fairly general NoSQL solutions and they also they have sweet spots but in our case we were looking at something very, very simple. We just actually have key and values, that's all we need. Something with extremely low latency and something that performed consistently and predictably. And we looked at a number of different projects including open source projects have actually started running on open source projects. I'm not going to bash them right now, but we ended up actually going for a commercial vendor. Which I'm not going sell either right now. Well, yeah. They're called Citrusleaf. It's a really good company, actually. They deserve a lot of credit. They took a lot of the jitter in our performance just right out the equation, out of the equation. So, very, very simple yet very capable NoSQL solution that performs insanely well. This is the stuff that we're running on our SSD drives in a fairly small cluster, but can still take immense, the throughput is awesome. Pete: So you mentioned that there's faster technology out there than the typical MapReduce, Hadoop stack, that many engineers are becoming familiar with. Talk to me a little bit more about the options there, and what you guys are looking at. Dag: First of all, MapReduce is just a principle on how you can process or split big data into smaller chunks and then process in parallel and then try to reduce it into something meaningful in the end. And there are a lot of generalized MapReduce frameworks, the most of well known one is called Hadoop, which is basically a rather large stack of software which is founded on the principal of MapReduce. We started using Hadoop. It does have some strengths as in that, it's open source and it's fairly easy to get started with. We've had some issues running into bug It's fairly difficult to get everything up and running there. But it worked all right. But we were seeing, for the things we do, we need shorter response times. We needed to do queries over really large data sets, and we just found out that Hadoop didn't perform well enough for us. We were using Amazon's elastic MapReduce framework, or their cloud offering, which also works pretty well, but it just didn't give us the performance we wanted. And the fundamental reason for this is probably or at least into large degree that when using Hadoop and you are using raw log files underneath. You actually have to, if you have thirty terabytes of data, then you have to read thirty terabytes off a disc. Whereas, very often, you just want to pick, let's say, one of the queries we might run over this data set, you want to see how, what's our over the course of a month, how many ad impressions could we sell to iPhone users in North America, and all we're interested in them is the platform. We detect it and maybe some IP ranges or DMA or something like that. But, just a couple of fields. Pete: So you're talking about selecting a value out of just one column, out of a data set that might have many columns in the row. Dag: Yeah. So, that's exactly it, right? And so Hadoop is brute forcing a lot of these things. What you really want to do is just, well, this is just like the poster-boy for column-oriented databases. There are a few open source ones there, too. So, HBase is one, but Cassandra is probably one of the more well-known there. HBase then Cassandra. There is an interesting project going on I know that the Datastax guys are doing, with running Hadoop on top of Cassandra, which then, at least, takes away the storage and load, I/O, problem out of the equation. But it turns out that there are a lot of proprietary commercial vendors here, that offer solutions that still just blows anything else out of the water. Some of them are actually hybrid open source as well like Infobright it's based on MySQL and has a proprietary storage engine. And has a communication, but you actually have to. If you want it. We're not using them, but if you want these solutions you actually have to pay for them. At least for now. And then they use compression, column-oriented storage, and they just, it's. I recommend anybody that's looking for analytics, and especially real-time analytics, to take a look at these offerings. I know that if we were trying to do this with something like Hadoop, well, we would have made it work. We could do rollups on a regular basis, all those kinds of traditional workarounds for it, but with these solutions you can actually just run your queries and SQL and they will respond in sub second even over hundreds and millions of rows. Pete: And a couple of those players again are? Dag: So the ones that we've been looking at are Vertica, Infobright, Aster Data, Greenplum. and another player called VectorWise, which is from Ingres. So, they all perform incredibly well. They do start to change a bit when you start doing. So, we have two types of big data basically. We have the impression tracking, which is based per campaign. That's still in the region of maybe a couple hundred million rows that we do analytics on. And then we have this firehose, which is in the thirty billion row region. And when you start getting up to those levels they start to differ a bit due to the way they're architected, some are easy to scale out where as others actually have to be scaled up. So they're different, but they're all worth a look.