Gtac 2013 - Continuous maps data testing

>>Tony Voellm: We are definitely in the final stretch of GTAC 2013. We have two more lightning talks that I'm going to introduce for you. And then we're going to have an academic talk, and then we're going to round out and finalize the day with security. So definitely stick around. I have a couple more of these Droids underneath the podium here. >>Tony Voellm: So with that, I'm going to introduce Yvette Nameth and Brendan Dhein. And they're going to be talking to you about continuous maps data testing. Here you go, Yvette. >>Yvette Nameth: Hi. I'm Yvette Nameth, and this is Brendan Dhein. And we are both Google testers on Google Maps. If you didn't see the video earlier, you can take a look, but I was the person in the lovely video that Tony showed at the beginning of the day. And this talk is actually going to talk a little bit about giving you a hint as to how we actually do the testing that I was talking about there. So why are we doing this? Well, take a look. Something might be a little wet in Mexico. Maybe a little global warming going on. Maybe the entire West Coast of Mexico is entirely flooded. This is what can happen when a maps data bug actually occurs. And this is not actually a software bug. This is actually the raw data that we're using to build the maps images. So let's talk about how maps get rendered. Well, we have all this data that's in a large repository. It's coming in from all these different feeds and creates that world data repository that you see on the left. In the middle, we have a data processing pipeline, sometimes known as our rendering pipeline, which generates images based on all of the different features that are in the world data. So things that would be in the world data would be features such as locations, cities, restaurants, roads, et cetera. Each one of these has an associated geometry which we need to then create a style for, which would potentially be like a polygon for a park, containing a fill color and a stroke list, and a label. So if that raw data is crap, obviously, the map coming out is going to kind of look like Mexico did. It was missing a big chunk of land. We're at a testing conference, so what about testing this data? We actually want to test every single piece independently. We can't just test the end product. I am currently primarily focused on testing the end product. But in order to get there, I had to first test the world data with Brendan. >>Brendan Dhein: Cool. So, moving on, there are some patterns and antipatterns we wanted to share with you. I'm not sure how many of you have seen that photo of the donkey lying down in the sand. This was actually taken from a Street View car a while back and made its way around CNN. Just to set the record straight, the donkey is still alive. It is not dead. It actually was set up there, and it's just taking a nap. How does this parlay into data testing? Well, if you notice, you can test a lot of things in this picture. You can test whether the donkey is dead or arrive. That would be a big feature. Or you can test if every single grain of sand in the picture still exists. And you can see this sort of with data testing as well. Let's say we have New York City. Do you want to test the exact geometry of every ZIP code in New York City? Do people care if the geometry shifts one bit or another? Or let's say do you want to test the name of every subway station? You want to make sure that important features exist, like New York's still labeled. Granted, for user experience, some things don't really matter. And part of the trick in doing data testing is determining what matters and what doesn't. For a naive approach, you could easily do a simple diff test. This sounds great initially, and you're going to be, like, oh, yeah, I'm going to test every feature. That won't work. And it doesn't take too much to realize it. But just to throw some numbers out, let's say you had 100 million different geographic features in your data dump. What happens if you had a 1% failure rate. Would you be able to triage that in a reasonable amount of time? And by "reasonable," we mean time to make a launch. Things that you really care about. So moving onward, that's all great, but how might you come up with a reasonable solution? And this picture here, I picked it out, actually, from my Christmas vacation to Barcelona. It's the Sagrada Familia. It's been under construction for ages now. It will probably be under construction until we're all dead and beyond there. You can still see the cranes there. And they're actively working. And that's sort of like data testing. You need to start and you need to have some sort of a plan, but you need to keep building up your corpus as time goes on. You want to keep up and go through and test and test and test, but you want to do it carefully. You don't want to do a simple diff test. You want to look at things like statistical analyses. I mentioned that, like, if every subway station changed in New York City. That's something you might care about. I mean, there was actually an episode on Google Maps where, for a very brief period of time, we might have displayed additional subway stations in New York City, like, by a factor of two and chose some really awkward names for the subways. [ Laughter ] >>Brendan Dhein: For those of you who don't know, apparently the New York City subway system was actually contrived of two or three different systems. And when you get data from your third-party provider, they may use the historic names. So we were displaying some IRT and BMT names. And when you had stations that, in our view of the world, sort of joined together, well, we stopped doing that. And that's sort of the type of data you want to avoid. And you can do that with some structured diff tests that test exactly what you want. You want to look, do my names still make sense for these critical features. You would want to, say, test New York City exists, Washington, D.C., exists. And that's sort of a basic smoke acceptance test. What you want to do, though, is, assuming that those pass, you also want to look to see if the state of the world has changed or not. You want to look, and you want to see, oh, have my oceans changed with -- or within a given region, has the makeup of the region changed? Let's say you're doing a data dump and you're processing it for export. Would you find it odd if, say, a given city doubled its number of aquariums? Might be a bit concerning. And along those lines, let's say you had a city that had a lot more area of airports. Maybe an entire airport went wrong and is now covering a city. It can happen. So if you try to build up tests like this, once you build a statistical corpus and divide the world into regions, you can make a lot of progress. What you can do is you can actually take each geographical region, say, a Metro area, or something you know you care about, and it's something that's small enough to be understandable by a human, but not large enough so that you're completely overwhelmed by the details of what's changed, and use that as a basis for analysis. Then you can think, well, if I had a manual tester, what would my manual tester be doing? My manual tester would probably look through the area and say, well, do all my roads still exist? Do I still have the lake that's in the middle of the city? Do I still have parks? These are tests that you typically see a manual tester do. Now, we have computers, and this is GTAC, we're trying to automate this. Could we perhaps have a system that goes through and looks and listens and just tries to interpret what's changed? Has there been a dropout? Has there been a suspicious gain? Have you gained an entire new set of features? Now, assuming you can do this on one region and you get parsable, understandable, and human-scale results, you could probably speed it up even. I mean, we're trying to do this fast. Each region can actually be executed in parallel. Now you have a testing architecture where you have locally specific outputs that actually have quantifiable and reasonable results that you can understand and interpret before you need to launch. And that's sort of the sweet spot you want to be in for data testing. And on that note, we move towards frequency. >>Yvette Nameth: So now that you know that we're going to be doing map reduces over all these different areas, how often should we do this? We've got this very, very large repository that takes a long time to make. And by "a long time," I don't mean minutes. Everything in the Geo data world, whether it's that rendering pipeline, which can take up to a day, or data processing prior to that pipeline, takes -- we're talking hours. Some of these processes take days to do all the correlation. So how often do we want to do this? Well, we push map tiles on about a monthly basis. But if you're trying to test the data on a monthly basis, you've got a month worth of changes that come in from all the different changes that are happening around us in the geographic world. These are coming in all the time. We have this product called MapMaker that's pushing data changes to us all the time. We're getting new feeds updated all the time with batch changes. So that would be like trying to drink from a geyser. It just would really suck. It would kick you back on your ***, you'd be bloody, and you'd be picking yourself up and fighting fires, trying to get some version of the world that was actually a legitimate representation. So we're not going to do that. And, you know, we're all overachieving testers. So we think we should test every change. And that actually really, really sucks, too. Because like I described, things are changing every millisecond in the data, and they might be very small. And doing this parceling out of testing into MapReduces actually takes a long time, too, unfortunately. So every change is kind of like drinking from a waterfall. It's just this never-ending barrage of water. So we're not going to do that. We kind of came up with this compromise that said daily. Daily gives us a really good signal. It's something that we can look at and say that on this day, we see that all these parks disappeared. What happened on that day? Was there a batch change to a feed that might have parks? Was there, you know, some public data changed about them? Did some user come in and decide to just delete all the parks in Washington, D.C.? We don't know. But now we at least have a starting point that sort of gives us a little bit of a temporal location. So we finally have our happy little water fountain to drink from. And that's what I'll leave you with. [ Applause ] >>Tony Voellm: Great. Great. Thank you Brendan and Yvette. So we have Q&A so you can line up and ask questions if you like. We can take one live. And, like, in five seconds, I will have the moderator in front of me. My takeaway? Wow, you get to go to Hawaii a lot. That's -- >>Yvette Nameth: That's Australia. This is actually outside of Melbourne. There are baby Penguins and little Penguins that you can find under your car. >>Tony Voellm: I think there are two questions over here. But they're deciding who's going to go first. Please. >>Yvette Nameth: Don't be afraid. >>> I'm just curious about -- I assume users will sometimes find errors, since you can't check everything? But how often would you say that your tests find errors in the data as opposed to the users finding errors? >>Yvette Nameth: I think the scale of the error that a user finds versus the scale of an error that our test finds, we're talking massively different. Like Brendan said, we're checking for the really big things, you know, the things that have to be right. Like, the Eiffel Tower has to exist on the map, because a majority of users would notice that. We're not testing that, like, your parcel of land, like, that little outline, is exactly correct. Which is what, like, you know, my aunt complains to me about when I tell her I work on Google Maps. But how am I ever going to know this? So I would say that we catch a majority of the large ones. I don't know if I would give a back of envelope percentage for that. And I would say we catch very few of the small ones unless they are systemic, like every parcel of land is misplaced. >>> Okay. >>Tony Voellm: Great. Next question up over here to the right, please. >>> Hi, my name is Igor. I work for Nokia maps. And my question is, do you compile data? If yes, do it as both raw and compile data? >>Brendan Dhein: That's an interesting question. And let's think about how to answer this correctly. So in terms of do we test our third-party providers coming in and the actual data package that goes out to maps? >>> Exactly. >>Brendan Dhein: Yes. >>Yvette Nameth: So, yeah, we test both and can't really say much more than that. >>> Thank you. >>Yvette Nameth: I really like my NDA and my job. [ Laughter ] >>Brendan Dhein: Exactly. >>Tony Voellm: Yeah, I always like these questions. Great. We have time for just one more. So please. >>> I'm Dylan (phonetic). I work for Google. High-level question. You talked about data, and before now, we have mostly been talking about code. Does the situation come up where you have a data push that's totally valid that suddenly sets off a code bug that maybe has been in production for months, you know, and other mitigation -- The question is, what's the mitigation strategy for that? >>Brendan Dhein: Depending upon how bad the code bug you found, very fast roll-backs are essentially very key to doing a data push. We want to be able to see, if we do have a problem in production, to turn it off. But also canarying your data and just following good release hygiene and trying to simulate any type of failure mode before it actually hits the user. >>Tony Voellm: Great. >>Yvette Nameth: And that is the second most common type of bug after the raw data bug. So.... >>Tony Voellm: Great. And with that, thank you, Yvette. Thank you, Brendan. [ Applause ]