Rockstar Ctos Doing What's Never Been Done before Reveal Their Secrets!

bjbjA Ad:Tech NYC CTO Roundtable Srini Srinivasan/Moderator: We have here a distinguished panel, from several real-time advertising companies who are key technical leaders in this area. The panel members really need no introduction, but I'm going to request them to give a brief introduction for themselves, starting with Mike from AppNexus. Mike Nolet: I'm Mike Nolet, CTO/co-founder at AppNexus. What we do is we sell technology to companies, which helps them run real-time businesses. We work with real-time sellers and real-time buyers to Ad Exchanges, SSPs, DSPs, ad networks, all the various different kinds of entities in the space. Our major customers are Microsoft Ad Exchange, for example on the sell side, Orange, now Interactive Media, which is Orange Telecom in Germany, and then on the buy side we have major companies like Collective, and also eBay, that use us to buy on a real-time basis. I asked the audience to see who, up here, does not fit. There is one person wearing a jacket. And he works for Aerospike, no? Srini Srinivasan/Moderator: Let me interject for a second, the person from BlueKai was unable to attend, so I invited the CTO of Aerospike, Brian Bulkowski, to could join us. He's the odd man out, so Mike's right. Mike Yudin: Hello. My name is Mike Yudin, I'm the CTO at AdMarketplace. We're an advertising technology company based right here in New York, and we operate the largest search network outside Google and Yahoo Games. We work with major, best and largest internet brands delivering performance pay-per-click traffic to these advertisers world wide, and remain the eighth fastest growing private company in New York. We have a lot of traffic just like everyone here on this panel, and we solve complex advertising problems in real time, result with data that comes our way. Dag Liodden: I'm Dag Liodden, I'm the CTO of Tapad. Tapad is a fairly young technology company, we help advertisers reach audiences across their multiple screens. If you're a user in this day and age, you probably have tablets, you have iPhones, you have laptops, and what we try to do is we help advertisers target, and measure performance across multiple devices. Pat DeAngelis: I'm Pat DeAngelis, I'm the Chief Technology Officer for [x+1] Solutions. We're a digital marketing hub, much more on the advertiser's side. We enable cross-channel analytics and optimization across multiple touch points, typically for enterprise clients. Typically me would do site optimization, we have a real-time bidding DSP if you will, using AppNexus' sell site. We are actually a partner of AppNexus. Our clients include J.P. Morgan Chase, Capital One, Fingerhut, FedEx, Delta, and some of the largest brands on the Internet. Brian Bulkowski: I'm Brian Bulkowski, from Aerospike. And I'm filling in for the BlueKai gentleman. And as co-founder with Srini Srinivasan, and one of the inventors of our technology and database. Srini Srinivasan/Moderator: Thank you. In terms of how we will do the panel, what I am going to do is to kick it off with a few questions myself to the panelists, and given the fact that the room is small, we can be a little bit more interactive. So any of the audience members here, if you find that the discussion is either too technical or not technical enough, feel free to raise your hand. Or for that matter, if you have questions after the first few questions are answered, I'll throw it open to more questions from the audience. Thank you. So let's get started. Real-time big data has been used in several critical aspects of the advertising business. We at Aerospike have been fortunate to have a front row seat to witness the revolution of this real-time beginner technology to the experiences of our customers. This panel is an attempt by us to bring together some of the foremost experts in the area so that other people can learn about this evolution and participate in this session. These companies in real-time advertising use big data in very interesting ways. For example, they use it at the H server level, where they deal with millisecond SLAs day in and day out. They also deal with analyzing the data and then feeding new model in on a periodic basis, maybe every hour, every day, and so on. What I will start with, the initial question to the panelists, the question essentially is: What are one or two instances where real-time data processing has had a tremendous impact in a positive way on your business over the last year or two? I'll start it off addressing this to Mike from AppNexus. Mike Nolet: On our side, I don't know if it's a positive impact so much as it's something without which we can't operate our business. When we do real-time buying, which is obviously something we do, we listen to every single real-time ad that's available. We power a fair number of them, and that's a whole bunch. We see in our peak day I think 39.5 billion ads in one individual day. It's about 600,000-odd requests per second. Every second -- I think right now is about peak time -- every second we are bidding on 600,000 ads. And the reality of that is, a lot of that buying is being driven by behavior, and so we must have cookie data server-side. For us it's not a choice. We have to have a server-side data storage. We have to be able to do 600,000 reader requests a second. Now we deliver about 150,000 - 170,0000 ads a second, so every time we win an auction we do write updates to our cookie store. And so for us, we don't have a choice, right? For us, we actually work with Aerospike, and we were - were we the first or second customer? First. So back when it was Brian and Srini in a coffee shop in San Francisco, and they were two guys and they said, "Yeah, we can do this for you." We didn't really have a choice, because we were working with another vendor that was just truly terrible, who will remain unnamed. And so we actually trusted them with this, and we've had a fantastic ride over the last three and a half years as we started at, I think, 10,000 qps and climbing to 600,000 qps. Probably hadn't found every bug the first time, so all good for all of you if you want to work with it. For us, basically, real-time key value store has enabled all of our real-time buying business. Also it's a platform for the ecosystem, so what's really exciting is -- I don't know if you guys know this on AppNexus you can effectively use our key value store, you can use our infrastructure and our data centers and actually put your own data in there and use that. It's really enabled us to provide fantastic technologies and offerings to our customers. Mike Yudin: AdMarketplace is a search syndication network. What that means is we look at each request for ads, of which we get about a billion a day, or 50,000 per second, in three dimensions. We'll look at the traffic source where an ad is going to be displayed, we'll look at the user who's going to see the ad, and we'll look at targeting information such as the keyword that the user typed in the search box. When a request gets into our system, we look at these three dimensions. We have to pull data in one or two milliseconds, on all of them, we have to know as much as possible of the user, as much as possible about the traffic source, and match this based on the keyword with all the ads that our system bid on. And then make a prediction, essentially, as to what would be a fair price per click for each ad that matched this request. And we turn this to advertising, all 50,000 requests per second. Just like Mike said, there's no question, there's no sort of supplementary value of the real-time data store, it's a necessity. Without that, we wouldn't be where we are. All the competitive intelligence for a net company like ours, everyone on this panel is in edata and how you use that data. The more data you have and the better access to this data you have in real time, then you can make very intelligent decisions and you can have sophisticated offline processing. But what is modeled all happens after the fact. The core of this is a very simple business. You get in a request, look at data, in ten milliseconds they'll ask you to turn ads, and then you see what happens. That's a constant, never-ending cycle. Audience member: [??] [10:06] Well, a success story. Sure, very simple. One of our advertisers, Volvo cars, started an ad campaign with us. Their goal was to actually get people into a Volvo dealership to drive Volvos. So we started a very broad ad campaign. We'd never had Volvos advertised in our system before. So how do you know if a person who is about to see this ad will have any interest in driving a Volvo? So what you do is, you look at past history of these users. You see, has this person searched for things like new car prices, or test drive. If you see the user with that type of search, in the history in the last week or so, you can probably know they're in the market for a car, especially if the request comes from a perfect source, like a car blog or something like that. It doubles your chances. So we use all this data. We successfully execute it on the campaign, and I think Volvo reported that the ROI was actually higher than their search engine buy on Google. You can find this on our website. So there s many stories like this. And this would not be possible without data. Otherwise, you'd just be spraying and praying as they say. So we don't do that. [11:40] Dag: So, our business is similar to what Mike said. We also do real-time buying. What makes our setup a little bit different from all of the other places is that we're not just looking at individual devices, we're also looking at how they're connected to other devices. And we want to try to use that for targeting, but also for attribution reporting, so after something happens, so if someone actually goes and buys something, we want to see which of the devices were involved with this chain that led up to this purchase. What Aerospike has enabled us to do is not do all these things after the fact, so traditional in this space, you often do log shipping, and then you go through these logs afterwards. You sift through, and try to find patterns and you kind of do an offline batch processing of these things. What Aerospike has enabled us to do is that we can keep this entire data set, which we call the device graph, which has data about all the devices we see and also the connections between them. We can actually query that data in real time. If someone goes and buys or signs up for Netflix (they're not a customer of ours by the way) if someone buys a service, we can go into our graph immediately, start with that device and see which other devices are related to this. And we can pull that in real time, instead of having to shift these logs through some big distributed file system like Hadoop, and then have to run a heavy job that maybe come up with a result 24 hours later. We can do all these things in real time. We can call up the partners, say, a second ago someone signed up and the cross device impression history of this subset of the graph looks like this. So basically we have access to our entire data set and subset, in response times at all times. Pat: We also at [x+1] have a demand-side platform, which does real-time bidding. A lot of the stories I'm hearing from my colleagues here, I can echo the same sentiment. If you can't make a decision based on as much data as possible, in a few milliseconds, you're pretty much toast in that business. Where [x+1] is a little bit different, we also do a lot of onsite optimization. What that is, basically as an example, you go to the FedEx home page, you're going to get a bunch of offers, so we're pretty down in the marketing funnel; you go to FedEx's website, home page, their offers that they're going to provide are powered by [x+1], 100%. So these are pretty high value transactions. The way we do that is typically through a predictive model. We build predictive models continuously. We have all our processes that run and tweak these models, and basically these models execute in real time. So one side of the equation for us is to make sure that we can execute those models, as quickly as possible. And the other side is to make sure we have that vector of data on that user so we can provide the best offer. Otherwise, we can't optimize their experience, and we don't get paid. So I would say what's really helped, and the success story for Aerospike is, we can now onboard anywhere from 5 to 10,000 attributes for a user, put that up in our data store, and slide through those vectors with our model all day long. We can chain models together, meaning we can execute a model. If that results in fetching some more data and executing another model, we can certainly do that now well. Data can come from offline. We can get a file from somewhere like an Axiom, with 500 attributes, we load it up and in the next few seconds, that person goes to anywhere on, let's say, Chase's home page, their mobile app, what have you, we have that data. We can execute that model, we can optimize. That just wasn't possible to the same scale before we went to Aerospike. [16:05] Moderator/Srini Srinivasan: It's clear that performing at really high levels is necessary for running real-time advertising businesses. The other side of the coin is to achieve 100 % up-time. With the kind of weather issues we've been having lately on the East coast, some of our customers have actually dealt with these issues. I'm just going to go to Mike Yudin, and request him to talk about. [laughter] Okay, I'm going to give everybody a chance, but I'll start with you to talk about how hard it is to achieve 100% up-time, and how it is so important in your business. Mike Yudin: Okay. Well, thank you, Srini. You have to remind me about the most depressing week of my life. Moderator/Srini Srinivasan: I'm sorry. Mike Yudin: But it's all good. We do a 100% uptime. We lost one of our data centers in the flood, and it's not just the data center itself that lost power, it's the entire network infrastructure of the tri-state major area, all the backbones. Your Verizons and Sprints of the world lost connectivity. So, and then we stayed up and didn't lose a bit of data. How did we do this? We do this by having redundant, not only redundant equipment within the data center, but also the globally load-balanced infrastructure across multiple locations. If one gets flooded, then traffic just gets shifted into the data center that survives. The trick here of course is to make sure that your location has all the same data and all the same intelligence as the system that got destroyed. There are several techniques in this, and one is cross-data center replication of data. So this is one of the reasons why we chose Aerospike. They have this ability, so our data centers exchange data between each other through their XDR cross-data center application process. That works quite well, and it's fast. If a user is in Chicago, and they do a search for a new car in Chicago, and it hits the Chicago data center, in less than a second this information propagates through the New York data center. If the same user then goes to another website, and a request hits and a disaster happens, and then his next request arrives to a different location, all the data is available. Of course you have to plan, and you have to have a disaster recovery plan in place, and then you have to have a plan in place for what you do after everything goes back to normal. That's what we're dealing with today. And you also have to make sure you choose a nice and sunny location for your disaster recovery office. I spent this... Pat: And then you get earthquakes... [laughter] Mike Y: where there are no earthquakes. I could have gone to the south of France in the same amount of time it took me to get to Pittsburgh, Pennsylvania. So that's my story. [19:37] Srini/Moderator: Any of the other panelists want to add their thoughts to this? [19:45] Mike Nolet: I think the one thing I'll tell you we do is, first redundancy Data Replication. We were talking about this before the panel, you have to have multiple locations. Not only in advertising are multiple locations enough, but also understanding within each of the locations how you're connected to the Internet, and how you're connected to different partners, and your point around network infrastructure. Many of the ISPs had major, major flooding inside their hubs. In our facility, we're lucky enough not to lose power, so we stayed up. But we saw that half of our network providers lost power. We lost cross connects to Google, to Amazon... Mike Yudin: Well that's what happened to us, we were up, but all of a sudden our traffic dropped 70 % because no requests were coming through. So how good is that? Mike Nolet: It's true. 111 Eighth Avenue is one of the largest buildings in Manhattan, and it went down basically for eight hours. And this is actually where we meet up with Microsoft and Amazon, with Google and all these major companies. Our network team was actually out for seven days straight, playing Whack-a-mole, routing traffic, trying to make it work. And then the one last thing that I'd say that we do that actually helps a lot with the data, we actually -- well, we don't use the Aerospike replication, we wrote our own replication layer, and what we do is we replicate incremental changes to all of our user data. One thing we track a lot is how many ads you've seen, right? And of course behavior. The problem is -- let's say I have your resume, right? And I have that in your key value store. As I change the resume both in LA and New York, how do you make sure that both copies of the resume get the exact same change? One way is to send the whole resume across, but then you end up, actually, if you have conflicting changes, you can end up losing data in the middle. So what we always say is that we're adding a line to the resume, at this location, and we might do that multiple times on multiple records on the same user. Then even if we lose connectivity, or something weird happens, whenever connectivity comes back we stream those messages back and forth to each other so that in the end those two copies end up being the same again. I think that's one technique that was used very successfully to make sure that our user data in the end always ends up being 100 % the same across all of our locations. [21:55] Srini/Moderator: I'd like to add a follow-up, to Brian. For example, AppNexus has been our first customer, so our cross-data center support only showed up last year. So I'm asking Brian, is Aerospike going to solve these problems? Brian: Yeah, but I wanted to share another customer's story, which was our first appointment with cross-data was with a company many of you probably know. Exelate. And Exelate has three different data pools, a US data pool, a global data pool, all of them replicating among four data centers. What happened to them was they lost one of their New York data centers as well. Everything backed up on the servers of the data center feeding New York was fine, it ended up when the data center came back re-replicating. What they did say that was interesting was they actually had to call us, because when their New York office went down, they had IP-based security into their data center so they couldn t get into their they had abandoned their office. And at home, they couldn't actually get into their data centers to do a graceful shutdown of the servers. So they had to call our support guys, and say, "Hey, Kavin, can you please take down these servers gracefully because we just got notice that there's only 30 minutes of pool left." We were happy to oblige and help them take down their servers gracefully. That's the kind of thing we do. Only a few of our customers lost full data centers. As you say, connectivity was really the issue. In terms of being able to support at our layer, both what we call delta shipping , which is just the updates, which is basically the techniques Mike was talking about, we'll be having that probably in the next six months or so. It's an important technique, both for bandwidth resolution as well as getting the correct data and not losing updates. [21:57] Srini/Moderator: Okay. Mike, did you want to say something? Dag: Cross-data center, replication and redundancy are important, but of course you also need to have intra-data center redundancy, and this is where this product also does a really good job. If you lose a node, you have routing into clients and into servers, so they will route traffic to wherever the data is. If you add nodes, you don't have to plan for if you can just add nodes and they will automatically start sifting data from the other servers. Data centers do fail, fortunately they don't fail that frequently. Servers, they fail pretty frequently, or they don't even fail -- someone just takes them down by mistake. I think that happens quite a bit as well. We've seen that happen a few times. Fortunately it hasn't eally affected us. [24:47] Srini/Moderator: There was hand raised in the audience. Is there a question from the audience? Audience member: I got the answer... Srini/Moderator: Okay, that's great. Please do... Audience member: I have a question. Srini/Moderator: Please. Audience member: Can some of you comment on what lessons have been learned from scaling systems, something that's not obvious? For example, there was something from Mike I believe [?]. What advice can you give about, what did you learn from the scaling systems? [21:51] Mike N: That's a broad question. I'll give you two answers to that. Specifically, two lessons learned. I think for us, we found anything that's not simple will fail. And now we're rare in terms of throughput and volume, because we're just running on so many servers, running so many ads every single second. The simpler the architecture, the fewer points of failure, hands down is the best. Which is actually why we threw out all our load balancers -- because at some point we found that load balancers start dying at 600,000 qps. We found that we had more outages due to load balancer issues than we had due to applications or software issues. We ended up effectively embedding load balancing into our direct applications, and magically things got a lot better. I think simplicity is just, hands down, hands down, where you actually want to be. And then the second one is full, end-to-end automation of everything you do, like your example about manual error, right? People fail. People naturally fail. They'll fail 80% of the time on the average day. This is not a problem; this is not a bad person. Your good engineer will make mistakes all the time. Now, at two in the morning, you make mistakes most of the time. Anything that's not automated will have production issues. It's really that simple. If you can't point-and-click deploy, what you have if you can't point-and-click roll back, you can't point-and-click debug, you'll simply have production issues because at two in the morning someone's fat fingers and types the wrong thing, and "Oops. ***. I did something wrong." And so, you want to keep it incredibly simple, and then automate the snot out of it, is what our head of tech ops actually says. You look at where we have production issues, we're not perfect, and the only place where we have serious issues is where we do not have full automation in place. Where we can't just pull up a new server, or fail over a data center, or anything like this. And our stack, it's almost all home grown. We use Aerospike for key value stores, we use Vertica for calling our databases for reporting, and then the rest of this -- I don't know Hadoop if you call that a license, because you have to do so much engineering around Hadoop to make it work. And then everything else is home grown software for us. [27:35] Mike Yudin: So I could probably add to this. Mike is saying, keep your systems simple, and that's key to this. The way you keep it simple is you have to be very smart in devising your intelligence into online and offline. You do all the heavy lifting over predictive modeling, all the crazy algorithms offline. What you program into your real system is just really quick lookups. And another advice is to keep your system asynchronous. Because as soon as you have components depending on other components, depending on other components, and everything waits for the other thing to respond, and then everything is fine, everything works just fine, but then one little thing is going to fail and there's going to be a cascading effect and everything is going to come to a crawl, and you just have an avalanche of degradation to the system. So you have to have a graceful degradation policy, and you have to have asynchronicity as much as possible. We just had a discussion right before we started, about blocking threads, and these kinds of things. That's the key to this. As far as technology stack, and we all were just discussing this, pretty much anything works. These days, hardware is powerful. Use a proven platform, whatever it is, C, Java, .Net, they all work. Probably not a good idea to program your real-time ad server on Ruby on Rails, but other than that it hasn't been a problem. Audience member: [??] Dag: What's the question? Audience member: What s beyond Hadoop. [29:20] Mike Yudin: What's beyond Hadoop? I'm not going to tell you, because we don't use Hadoop. [laughter] We don't use Hadoop that much. We found it's too slow for us. The processing cycle for data in Hadoop is just... Audience member: [??] Mike Yudin: Well, the main principle of Hadoop is a distributed, kind of greek computing system. That's not going to go anywhere. You have to do that. People are trying all kind of things. We ended up writing our own proprietary system. Whether that's going to become the Internet standard or not, I don't know. But, probably like some of my colleagues, we found that very, very few commercial or even open source solutions support this key element, so we ended up programming a lot of this ourselves. It's kind of tragic, but that's what it is. [30:10] Dag: I would say, in terms of lessons learned, metrics. Keep metrics of everything, because scale kind of creeps up on you. You start seeing your latencies jitter, and you want to correlate it. You would try to figure out what's going on. If you're building an adtech system, you have a lot of moving parts. You have a lot of endpoints that get hit. These have different impacts on your system, and if you don't have metrics, you're pretty much blind. Scale, web traffic shifts, grows from month to month. We're hooked into AppNexus and they get more traffic all the time and they one day all of a sudden realized, "Oh, crap, we're over." Things would slide just a little and then we're over 100K qps, for instance, and we see that it's this kind of traffic that causes this kind of ripple. Any sort of debugging, if you have any sort of performance regressions, always looking at the metrics as the number one thing we use for debugging. You can't live debug and step through coding production, and sometimes the only way you can debug some things are actually hitting them with real traffic, unless you have unlimited resources. But metrics is what I would say. And beyond Hadoop, [laughs] a more efficient file storage on the [??-31:35], I think. That'll do a lot. Having flat files for everything is... Audience member: From the standpoint that Hadoop is not the answer, I already know that, what's the future? Mike N: Why do you say Hadoop's not the answer? Audience member: Well, I... Dag: I don't think -- Hadoop is a lot of things... Srini/Moderator: I think we're getting a little bit off the topic. This is about real-time big data. Hadoop is not real-time in any way, shape or form, as far as I understand. Audience member: I understand, a lot of people try to write plug-ins, and... Srini/Moderator: Yeah, it is a way to -- for example, how Memcache speeds up MySQL is the same philosophy there. But then things like Aerospike, and a whole bunch of other NoSQL databases, they try to actually do a database at the speed of a cache. So beyond Hadoop, you know? If you ask me, I'm biased. I'd say Aerospike. [laughter] [32:30] Mike N: I'd love to make two points. One, you only have a certain amount of tools in your toolbox, right? Hadoop is one of those tools. And if you try to screw a screw with a hammer, it doesn't work very well. Just like if you try to use a screwdriver to hammer in a nail, it doesn't work very well. Hadoop is the right tool for some things, Aerospike is a fantastic tool for some things, and Vertica's a fantastic tool for other things. The problem people make with all of these, including key value stores, is that they try to smoosh too much functionality and too general purpose, and try to make this super-fancy, multi-tool that does everything. And it turns out being pretty mediocre at everything. When I get pitches from vendors, a lot of times if they sell me too much -- there are a lot of these people working on distributed real-time MySQL systems. One of these guys came in, and pitched to us that "We can be your key value store." I actually rejected it outright, simply because I don't want a key value store that does SQL and joins. It means it's a multi-tool, which you just know is going to have some kind of complicated performance issues. It's just too complex for what I want. I hope that Brian and Srini don't try to turn Aerospike into multi-tool, and that they understand what they're really, really good at, which is having key value, NoSQL based data, an incredibly low latency and incredibly fast. So I think that's the best answer to your question, that Hadoop is a tool that's really good at some things. Aerospike -- there are new tools coming out that are really, really, really exciting. And Aerospike is one of them. I'm also very excited about stream-based processing, which I think we're going to start seeing more of. Which could be -- I don't know if you guys are talking about some of this -- new products or things like that. That's what I think is going to get really, really exciting. Audience member: [??] Mike N: No, no. Audience member: One of the stacks [??]. [34:30] Mike Y: Well, I would like to add to this a little bit. There are more and more companies you see at tradeshows like this. That are always on the cutting edge, and they are the most sophisticated algorithms, the most amazing model ever. Hadoop is just not good enough for them. Anything is not good enough for them, because they process so much data, they have to do the next coolest thing. The truth of the matter is that there are very little actual working models and intelligence in this advertising world. A few things really work. If you're trying to solve a really complex problem, beyond capabilities of the standard stacks and proven technologies, I'm going to bet you a hundred bucks you're probably not going to solve the right problem. [35:17] Mike N: Can I add, and then I'm going to totally let you take over. I think what you just said is exactly true. What happens is, doing something at low scale -- like what you hear from all the CTOs, we're saying, scale, scale, scale, scale, scale kills -- because it's very easy to build an online advertising product at low scale. It's very easy to build a super snazzy, dynamic, creative, it's interactive, it talks to you, it uses your web cam, you can get real-time reporting on the back end -- if you're only serving a couple thousand ads a day, no problem. I've got an engineer who could turn that around for you in a month. The problem is when you start doing it a million times a day, and a billion times a day, and ten billion times a day, and 40 billion times a day. That's when all those features and functionalities that you have in that really cool product break. And one of the problems with innovation in all advertising is that people don't think about scale. People raise VC money, build a prototype that does a lot of really cool stuff, they say they do it at scale but it really doesn't do it at scale, then they hit scale, and then the *** breaks, and then you have to rebuild everything. And there just aren't enough commercial tools out there to make these problems go away. So you suddenly end up needing 40 engineers to make all this actually work. [36:26] Pat: Use the right tool for the right job, as Mike got to, and be willing to iterate and explore before you get into production. We have a lot of complexity, I would say, in our system, but we've approached it as splitting that problem up into as many discrete parts as possible. Again, echoing the sentiment, it's got to be testable. It's got to be modular. If you want to stream it, you want to communicate, use something zero mq. There's a lot of queuing out there. There are ways to communicate among these different components, and that way you can test them. We are very, very diligent about metrics, as Dag mentioned. If you write code in our system or for our system, that thing better log and let the world know what the hell is going on with that component, pretty much at every step of the way, if it's interrogated. Because if we can't -- again, echoing the same sentiment -- if we can't look and see exactly what's happening inside any of these components, we're screwed. And you're flying blind. And then we can start testing them, we can run through any regressions we need to, and we can see where the bottleneck is going to be. You'll never catch all of them; sometimes you don't catch any of them. Hey, you know what? We never thought someone was going to do twenty-five placements on one page with 50 models. Sorry. Okay, we didn't. Bad on us. But you can catch a lot of that stuff, or at least you can pinpoint it. Otherwise, again, I don't see how anyone would even want to go to work in the morning without knowing how everything's operating. [38:09] Brian: So I'm going to add to your question, you said something that's non-standard about scaling. One thing I'm happy we did at Aerospike early on, is that most cluster databases -- for example, Oracle RAC and many other of the support-based systems -- charge you per node. And what happens then is every single time then, the ops guys, who have a feeling for the number of nodes they want in order to have the resiliency they want, they get crowded out by business guys and it ends up being some long, involved conversation about, "Well, do I really need a four-node license," "Why do I have to buy a six-node license," all that stuff. So one thing we did to make all these guys more successful, and the whole product more robust, was our business model. Plenty of technology, and all of the logging and stuff like that, but I wanted to say, "Hey look, let's have your ops guys really figure out what they want in terms of reliability, the number of copies of the daily load, we'll decouple that from the license terms, and as you start iterating, seeing your load go up, you need to add more servers, great. You're not calling us." We want you to do that, and feel comfortable with the amount of hardware you have without having to start thinking about license terms. Because then people's heads explode. "Where's my budget? I didn't ask for enough budget. I have to justify more." Stuff like that. So think about the impact of your business model with your scalability. [39:32] Srini/Moderator: Is there any other question from the audience? Okay. Audience member: [inaudible 00:39:35-00:40:50] When you scale, things break [?] [40:51] Dag: That's a hard question. Of course it is hard to design scalable systems from scratch. First of all, I think it's a mistake to always, always keep "how do I scale this to" If you're AppNexus, you need to scale it to a certain level. If you're us, we're not in that scale, but we're still in a high scale. But if everything you build is always constrained by scale up front, than you will lose momentum in your innovation. It's okay to build prototypes that don't scale so well, and if they actually turn out to have value to your business, then you can go back to the drawing board and see how you can make them more scalable. There are a lot of things that are easy to scale. And then there are some things that are really hard to scale. Doing everything right from the beginning, unless you've built an exactly identical system before, I think is impossible. Engineers that have built similar systems, or are interested in staying up to date and knows about the different tools, this is one of the things we're at here. NoSQL is sort of the acronym for everything that is not a relational database, pretty much. It is a lot of things, if we talk about just storage. Finding the right tools and thinking about that up front can be useful, but it can be retrofitted too, to some degree. I mean, Twitter, when they moved off their Ruby stack, onto JVM, that's gotten a little bit of attention now with the last couple of days because of the election, and they gave credit to their switch. That was the reason why they were able to cope with the spike in traffic now, a couple of days ago. At any rate, the point is that if they started out thinking about services and decoupling and building the system the way that they do now, they would probably have never gotten to the market at all. I agree with what Mike said, about it's easy to build something that's really fancy and works with a thousand users or a million impressions a day, maybe, but if you go in to trying to build a new product only thinking about what's going to happen when I have a gazillion users, then you're not going to make a lot of innovation, basically. Audience member: So in a way, in a way I guess, it's kind of like, in a way, it's kind of like if you're doing well, and building scale, most likely something's going to break anyway. [43:21]Pat: Be prepared to make a lot of mistakes. Don't be afraid of that, that's for sure. Audience member: Anybody here not making any mistakes? Mike Y: Yeah, sure. But I think also the question was, how do you teach developers to develop at scale? Audience member: How do you get yourself to what you do, [??]. And one thing that I like is, I actually like to do very well. The problem is, I don't think I [inaudible 00:43:53]. So... [43:54] Mike Y: I think that the answer is, you have to know, really, a few basic, basic principles. And if you follow these principles, you can reach decent scale. And these principles are; you don't make calls to SQL databases in real time. You don't make synchronized components. You have to have metrics and you have to have instrumentation so you can see what's going on. There are certain anti-patterns that you don't do. Okay? And you will encounter each of them. If you're doing this for the first time ever, you'll have a lot of failures. But you learn. Srini/Moderator: Sorry to interrupt here, but this discussion is complex, as you are learning. Every one of us here has learned form our experience as to how to deal with this. It is possible, I can guarantee you. But it is extremely hard. So I think we should take some of this offline, unless -- Okay, is there any other question form the audience? I want to give the audience priority before I have a whole list here. Okay. [45:10] The next question, I want to ask a specific question to the panel. Can you talk about an actual problem that you had to solve over the last year, in this real-time computer space? And what product or technology use -- it could be something like a key value store, it could be your own home-grown thing, but it must be in production. Let's start with Pat. Pat: Does it have to be Aerospike? Srini/Moderator: No, of course not. Pat: We had to solve a problem of cross-data center replication, and even when we have Aerospike now it's not quite what we need, but even before we had it we had the same problem, or a very similar problem, that Mike N. described before. That is, we could step on a record, we need to basically journal our changes. If something about a user changes, that needs to be journaled, that needs to be shipped across to another data center, and applied. We can't just do a wholesale replace of our objects across all of our data centers. So that is one case where we used a home-grown solution, using again Zero mq, which is a very simple lockless queuing system and we were able to journal our transactions and ship them across to all of our data centers, update them and keep them I would say eventually consistent without any major headaches. There are other queuing products out there that are commercial or open source, [inaudible 00:47:00] I would call more of a tool kit or a framework. But again, you need to understand what your replication needs and requirements are before you make any of those decisions. If it's Okay to blow away an object on the other side because it's out of date, then you can just fire and forget, as they say. But that's probably the latest thing we've solved. [47:23] Mike Y: Well, a very significant problem that we've tried to solve -- we haven't solved it yet -- is how do you combine all this real-time data with analytical data? So we have this real-time data stored in Aerospike. It's fantastic for really fast key value lookups. And it works. But then, imagine that you would like to have access to all the same data and look at this not one user at a time, not one object at a time, but find patterns and cross-dissect it. Make it available to all your dashboards and UI tools and algorithms and look at this data set as a whole, not as each individual key value. And keep all of this data in synch. That's a very difficult problem. So far, we've found some hacked up solutions to this, basically. Back the whole thing up, import this into a more friendly query able system, beat Hadoop , beat Vertica or something like this, join it with the rest of your data, do operations in it, because you cannot do, this type of real-time data engine doesn't work. Srini/Moderator: We're getting close to the end of our time, so let's keep the answers brief. Dag? [48:44] Dag: I would say that making the data feeds into actual feeds that are consumable not just by batch, in search of the database sorting into file systems, but actually making them something you can tap into and have easily extend your system into new types of analytical processes. So if you want to try, as Mike mentioned stream processing, which is something that we've stated tapping into over the last few months actually, how do you plug something into a system if it's already based on shipping log files, or if it's something that is happening internally in machines somewhere. You want to kind of democratize your data by making it accessible to new consumers of that data. There are a few interesting big data queuing systems coming up now, we opted for one of those. We run with Kafka now. That was a big shift for us. It also made our system more resilient, the asynchronous is easier to handle, we can now easier handle failures in our data stores and so on if that happens. That's the problem that we've been solving recently. [50:00] Mike N: That's funny. I love your answer. So we did two things recently, changing how we stream data. We wrote our own data streaming a couple of years ago. Every impression generates 10 to 15 log records, we've got 6 million log records a second or so that we're dealing with. So I borrowed a streaming infrastructure for that. And we added the ability to start splicing data. We can now take our stream of data, we can splice it, and we actually now have started streaming data to multiple places. It goes to Hadoop, for our standard hammer, to do aggregation. We've now built a very highly optimized, it's not quite streaming, but every two minutes we load into an HBase infrastructure where we keep an offline copy of our key value data, so our guys can view every data record that we have. in Aerospike we're also re-replicating inside HBase, which we can now do to do offline attribute conversion, all sorts of really, really, really exciting offline stuff. We're going to get to the point where it can do true, cross-channel attribution for any one of our partners. And it's really actually been doing stream-based processing, or stream-based log, and especially being able to splice that data into different places. The third thing, we have a prototype a guy just put together with Volt DB, where for several of our clients they stream the data into Volt DB in real time, and they have a prototype of real-time reporting, which I actually don't know how useful it is, because in journal you need several hours of data for interesting data. It's really cool and it actually works. What's interesting is that there's just no open source or commercial tools yet that do any of this. So this is a lot of the challenge we face, that you have to build this stuff from scratch. And what's funny, five years from now, I'm sure there will be an open source tool. It's a fantastic commercial opportunity for somebody, that'll just make it easier for the next guy. [51:54] Pat: And I'll just add to Mike, we use our replication journals to feed a database as you are talking about, Mike. That's how we do it. So when someone's state changes, if you will, they saw an impression, they moved segments, since we journaled that to ship to it across data center we can ship that to a Vertica database or something like that, and get an up-to-date view of the customer. Srini/Moderator: Okay, we have a couple of minutes left, so this is the last question from the audience. Audience member: Mike, do you have any.. [52:21] Mike N: No, because it's for us, very gripey, in the sense that when it becomes commercially available, sure, it might be good for us. At the moment, it gives us a huge advantage that it's a generic infrastructure, any developer at AppNexus who wants to get data from any of our data centers, from any server, into one location just types a config, streams some data, and it magically appears in whatever data source they want. It really gives our developers just a fantastic set of tools to run. I'm not sure we just want to give that to everyone else just yet. We are working on another couple open source projects, to look around dev ops systems continuous deployment, where we feel the open source community we've benefited a ton from, we're pushing back and sharing some of the things we've built on top of Puppet and things like that. [53:13] Srini/Moderator: Okay. Let me ask a final question, it's not as technical. Who's the one person who has unexpectedly helped in your career and your business in the last couple of years? It could be your mother, it could be anybody. But it's somebody who has affected the actual business, and career. Dag? Dag: Well, thank you. Srini/Moderator: I have an answer from me, if you want to ask me the question. Dag: Can we hear your answer first? [53:40] Moderator: Well, it s an advisor for our company, anIBM fellow, I worked for him 25 years ago. And then he joined as an advisor, and then he's been so key. His name is Don Haderle, he is known as the father of DB2 actually. And he's a completely relational database person. We're doing NoSQL, new database. But he's kind of taken us through, taught us, Doug, Brian and I, what exactly we were doing when even we didn't understand what it was. And, has been a great help in explaining this to other people, including investors, and customers, Fortune Five customers and so on. That was completely unexpected, what happened. That's the kind of thing. [54:28] Mike N: I don't have one person but I have one group, which has been the New York CTO Club. I don't know if anyone else here is a member of this club, but it's just an absolutely to join full-time so far. Sounds like a recruiting pool, it turns out. But really, it's having I've been very lucky to have access to such a fantastic group of people, who have really helped me be successful as a CTO, and have helped AppNexus scale, both by just advising and helping, also, hiring. Which is good. [55:20] Mike Y: Well, so it was a very unexpected question, so my answer to this question will be someone who's completely non-technical. That's the CEO of our company, Mr. James Hill. He really taught me how to simplify and crystallize things. He's very good at asking questions. What is it? Why are we doing this? How are are not in advertising is an extremely interesting way of trying to cut to the core of what you're doing as a company. Because the stuff that we deal with in the ad tech space, all the buzzwords that we have, all the fancy words to your point about the models, about predictions and big data and all that stuff, if you can't cut that down to someone and explain it to someone you meet at a random party, then you -- well, it's a very useful exercise, I would say. Pat: I don't have person either, I would -- it's going to sound corny, but I would say my team, who's helped and who's built all this stuff, and who operate everything every day. These are the guys, as Mike has seen, who drove to Pittsburgh. Sometimes you have to put in a heck of a lot of time, and I don't care how bright of an idea you have, how smart you are, how great you are, if you don't have a good team who's willing to do basically whatever it takes, you're not going to be successful, in my opinion. Srini/Moderator: Okay. Well, Brian, did you want to...? Brian: I was very surprised that our investors, VC Group, usually as an entrepreneur you think, you know what? You say nice things to your VCs, but you take the money and you say "Thanks you very much" and you use the money, that's their primary contribution. And anything they give you on top of that is gravy. But our first round investor lead, Joe Addiego of Alsop-Louie Partners, has actually been an immense help to the company. I think one of the benefits is he's very new to the VC game, so he's not as blas And he has a great operational background, and a great sales background. So in terms of helping us through the thicket of hiring sales people, especially who can be very persuasive in person -- you have to figure out who's good, back to your management question. So I want to put a shout out to him. He's been a great help through this. Srini/Moderator: Thank you. I think that brings us to the conclusion of this panel. I'd like to thank each and every one of the panel members for making time out of their busy schedules. They're all running 24 by 7 systems, their teams are running them, and if it's down for five minutes they don't make money. They're here because they've actually figured out how to solve this problem. So thank you very much, and also for a great experience here. Thanks to the audience for the questions. PAGE PAGE hIH} hr[Q hr[Q gdr[Q hr[Q hIH} seWsIsIsIsIsIs hr[Q hr[Q gdK' }o}` hr[Q hr[Q hr[Q hr[Q hr[Q gdK' hPi@ hPi@ hr[Q hr[Q hr[Q hr[Q hPi@ hr[Q hr[Q hPi@ gd8| gdPi@ hr[Q hPi@ hr[Q hRL+ hPi@ hr[Q hr[Q hPi@ hr[Q hPi@ kykyk hr[Q hr[Q hfU& hfU& hr[Q hfU& hfU& hr[Q hu^) hu^) hr[Q hr[Q gdNQj gdu^) hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q h_(i hr[Q hr[Q gdNQj hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q hr[Q h@#, hr[Q hr[Q hr[Q gd8| gdNQj phdhdh hr[Q hr[Q h@#, hr[Q h@#, hNQj h@#, hr[Q h@#, gd8| &`#$ gd8| gd8| hr[Q [Content_Types].xml #!MB ;c=1 _rels/.rels theme/theme/themeManager.xml sQ}# theme/theme/theme1.xml G$$DA : BR {i5@R V*[_X ,l\Y Ssd+r] 5\|E Vky- V4Ej 6NGU s?^V *