Reid Burke - Testing yui everywhere

Jenny Donnelly: Now we're going to deep dive a little bit with Reid about how the YUI project does our testing more specifically. Welcome Reid. Reid Burke: Thank you. [applause] Hey guys, my name's Reid. I work on the YUI team at Yahoo. For the last year I've been working on improving how we test YUI. It's been really great and I'm really looking forward to sharing everything that we've done in the last year with you guys. This isn't the first time I've talked about this. It was really great for me to talk last year about the whole purpose about why people test. If you didn't catch my talk because this is your first time at YUIConf I'd encourage you to look at at least the slides from my last talk, where I go over the pull point of what testing fulfills and what role testing has when writing code. But it is useful. Even though the point of my talk last year was basically that testing isn't your goal, your goal is to write code that does what it's supposed to do so don't go crazy with writing tests. But it is useful. We do it, everyone should be doing it. Like the last talk, the last talk was all about it doesn't really matter what you call it, it's just really important that you do it in a way that helps you check for common mistakes. Again, you're not paid to write tests. The really cool thing that's come, about one year ago it became possible for us with different technologies from Sauce Labs, for example, offered where open source projects could use their resources for free. We were really interested in that. That basically let us test on all the environments that we supported in YUI. It's really neat that now this is all possible. But to go over where we were a year ago, it was not so great what we were doing for testing YUI. We had a single CI job basically, and it was just simple but had problems. We didn't use Yeti, we were using something else. We didn't test Ajax, so things like I/O had to be tested by hand. One browser was tested at a time, so in CI we would only have, one browser had to run to completion then go to the next one, then the next one. We only had unit tests so no examples were being tested in an automated way, which was really painful. We only had two browsers and Node being tested in an automated way. What we wound up having was before a release we would have to do this testing one way or another. How we did that was a day-long get everyone in the room and test the code. You can see how this is kind of incompatible with doing rapid releases. So they had to go, we had to get rid of them, and we did, but in order to do that we had to find a way where we can test the 15 target environments plus the stuff that we test because they're going to be on the target environment list soon. But there were a lot of problems that we had getting there. When you have one CI job things are pretty easy. Another project I work on is Yeti. With Yeti, whenever something goes wrong the only place that I have to test it on is Node. So whenever something goes wrong I get one email that tells me that something's broken, because there's only one CI job. And because it's so fast, all the testing's fast, I don't need to send out my testing across many different CI jobs. So I get an email. This is kind of ideal. I go get an email, I go look, and then it's pretty clear what broke when you use Travis for pretty simple projects. But when you're wanting to have automation for 15 different real browsers, and when you have about 7,000 tests for every single browser, every single one of those 15, that's over 100,000 that you would test every single time you make a change. So we're getting beyond... I mean if we had just one CI job we'd be waiting hours, and we definitely can't do that. So we outgrew it. We had a whole bunch of other problems. But even though we outgrew it we found a way to get over them, and that's what this whole talk is about is how we overcame these challenges even though we couldn't use this simple model. So, introducing what we do now. We now have, just with a CI slave we can test with PhantomJS and Node. PhantomJS isn't really, we don't support it, but it's fast. Right away we can see if there's a problem. We use Grover to do this, which is an open source project on our GitHub. When we test in Node as well, with Selenium, we test every single IE. Even IE6 is tested, all the way up to IE10, Chrome, Firefox, and we also use Sauce Labs. Like I said before, Sauce Labs introduced this very generous plan where if you're an open source project you can use their resources for free and they'll actually help you out. We approached them and said that we're totally interested and we want to use them as much as it makes sense, and they've been really awesome. We do use them for testing for the things that we can't yet do any other way. It's actually really nice. We're hoping that we can use more of them in the future. We use them for Safari and iOS. Also, if you guys are remotely interested in Sauce Labs I'd love to talk to you. We actually have Sauce Labs stickers that are going to be on the registration table, and there's a whole bunch right here and you can pick some up. They've also given us promo codes so if you're interested, if you don't have an open source project yet you use YUI and want to test out what Sauce Labs does, they actually are giving people here two months free so you can evaluate it and check it out. That's all up here. You can come see me afterward. We no longer use basically a pile of hacks, we use Yeti to test our code. What Yeti gives us is Ajax testing with built in echoecho, another project by Dav Glass. We can now test YUI IO, we can launch browsers on Sauce and Selenium, we can run many browsers at the same time. Instead of testing one browser at a time we can run 3 to 10 browsers or however many you want all at the same time. That's been really huge. It speeds up testing quite a bit. We can test our examples. What we have is 11 out of 15 of our target environments are tested every time someone makes a push to GitHub.com. Our CI system takes that and runs it on 11, and the other 4 work with Yeti but they're not yet automated. For example, you noticed in the last slide where I was talking about what we automate, Android isn't on there. That's because we find it really difficult to get Android working in a way that is reliable on CI. But we can still test it before release and we no longer have to go through every example. We can point Yeti at the examples and at the unit tests. So we no longer have to have day long sessions, we just do this right before release and we're done. It's all automated, and it's awesome. Most of our environments are tested every single commit, so as soon as there's a problem with those we can see immediately what went wrong. In theory. But actually about, I don't know, about 3 or 4 months into the year when we had all this, even though we test everything in theory it was really, really hard to find where the problems were happening. I'll share that as the talk goes on. But the testing was fast. With Yeti we had 8 parallel instances of IE being tested, 3 parallel instances of Safari. With that we get parallel builds. With parallel builds we not only have each build running like 3 to 8 browsers but we would run many builds at the same time. That would be something like a Travis job or a Jenkins job where once you make a commit we have like four or five of those running at one time, going through and then running themselves in 8 browsers. Then also, to recap, we test all this. That's the large majority of our supported environments are just automatically tested. This runs 3 hours of testing in under 90 minutes and it's just getting lower as we tweak this and make this even better. It's still pretty good. Like in 90 minutes we know on the majority of our environments if something works or not. But it gets really hairy. This is the in theory part. The in theory part is kind of two things. One, even though we're running all these tests, even if there were no problems at all then it's sometimes hard to get that information out and see, okay, so we have all these CI jobs, how do we find out where the problem actually is? Because you no longer can just go to that one page and find everything you need, it's now over dozens of pages. Then the other problem is that we have a lot of moving parts. To give you an idea, we have 21 CI jobs, so 21 things in Jenkins for example. And then we have 3 branches. So that's 63 jobs. We have then about, let's just say 3, it can be up to 8, but 3 browser VMs. So that's 189 that could possibly be running at once. In order to run those VMs you have to have the actual build slave machine VM that's actually running the test. That's about 252 machines, which means 252 ways things can go wrong, if not more. Testing is now really hard. Instead of just having about 4 machines on our last setup, we now have hundreds. Things go wrong, things were flaky. We had too many stuff going on at one time and we had way too many test results to make sense of all this. The workflow for an engineer on our team was impossible. What they would get is an email that says something went wrong, there was a failure. You've got to check something out because everything is totally broken. But if you want to check out what the failure is, this is the information that we get. Really out of the box stuff, and Jenkins isn't much better if not just completely opaque. No one could understand this. What you care about is did the changes that I just made break the build. Yes it broke, something broke, but you don't know what. This is what we're dealing with, this is what most people in this room are suffering with. In Travis you just see the build is broken, it doesn't even tell you about what test. You don't know why it broke. Did NPM go down? You don't know. What we did is we tried to make it so that if someone had to take action on something we would call the build unstable. So the build's unstable now. That means that there wasn't any infrastructure problems. So somehow those 250 moving parts all meshed together perfectly, but now there's a test failure. That's great. Now this is something that engineers can take action on. Well, not really. They go to the build and they have to click on the test results link which takes them here. Okay. And this page takes about 30 seconds to load usually. If you use Jenkins with anything more than one job you know what I mean. Then you click on this page. Okay, now we finally see test results. This is terrible. No one should be doing this. It should say right in the email ideally. But the other problem is that some of these tests that are failing are what Andrew talked about in the community talk earlier. Some of these tests are failing because this is the first time we're running these tests on more than two browsers, more than IE and more than just an ancient version of Firefox. Once we threw all this stuff together we noticed that there were a ton of failures. If you're not testing on all these things in an automated way right now, you're also going to run into this problem if you have more than just a few tests. This is just one example. We have these tests. Sometimes they pass, sometimes they don't. We call those flaky. But the thing is if these are failing the engineers don't need to take action on it, not right away at least, because that person's commit didn't break their, they didn't cause this problem. These just are flaky. We address them and reduce the amount of flaky tests we have as much as we can every release, but the fact is we have them. The good news, though, is that over 99.5 per cent of our tests, I think it's 99.8 something, are not flaky. So out of the hundreds of thousands of tests we run only 300 are flaky at all. That's been getting lower and lower. But the problem is that that tiny fraction of 1 per cent is causing hell for our developers, and that's totally not acceptable. This problem is repeated for those 63 jobs. Each one of those can have a flaky test or some kind of other problem and then it spews out emails. It's like what are you going to do? You have this problem. So what do you do? Well at Yahoo we have a dashboard for this which is awesome, so we get to use the dashboard and that's going to solve all our problems. Well, no. We look at this dashboard. The question we're asking here is did my latest commit pass, did the last thing that I push, did that work? When you look at this it doesn't answer that question at all and it just makes you more confused. What it's doing is just taking Jenkins jobs and showing them the history, and that's useful, yes. We see more information, we know more. It's not over a dozen pages and it's fast, like you go to this page, it loads instantly. But it just doesn't work out. You're looking at this and you see there's no indication that the builds that you're looking at, the last build that ran, is actually the latest code. The other thing you don't know is if it's yellow is that a flaky test or not. We don't know. If it's red was that an infrastructure problem or is that a legitimate problem? We don't know. This is all really bad. You get flaky infrastructure with all kinds of sources for flaky infrastructure, Sauce, Jenkins, Build Slaves, Selenium, Yeti. It's just what's going to fail first. You're going to have this problem. Tests are flaky for the reasons I talked about earlier where they work, kind of, almost all the time, but it's the one time they don't that's productivity sucking, which sucks. The other problem that we saw with that interface is that, and we see with, this is everywhere, this is in Travis, this is in Jenkins. Our most popular build systems all have the problem where they use build numbers. Which, for the most part aren't really, we just don't care. Build numbers are something like don't answer any question that the developer cares about. The developer cares about did my commit work, did my commit pass, is it functional. When you're looking at all these different jobs, you see that they all have different numbers. You can have this job which is 576 and then another job which is testing the same commit that has a completely different number. There's no way that you can look at this and say that these two things, they might be running the same code, they might not be. And that's terrible. When you have this many tests, once you start writing a ton of them you notice that Jenkins will become slower and slower and slower and slower. It works fine. The problem is it works fine until you come to depend on it. You're wanting instant answers and it doesn't give you that. It takes over 10 seconds to load something or whatever, and that's pretty bad too. Ultimately what happens is that nobody's responding to build failures. When you get an email you don't know what to do, you just do nothing, and then the problems keep going. Halfway through the year we had this problem and it wasn't getting any better. Even though we were doing all this testing and we did all this work, we didn't see any end in sight for actually improving the quality of our code and being able to measure it because it was all over the place. This had to change. The solution we made to this is something called yo/tests. Internally at Yahoo you actually literally type in yo/tests and it takes you to this page. You saw it earlier, and I get to show it to you now. Basically what this does, the pain that we had it fixes that with three things. First it classifies flaky tests. Yo/tests understands when something's flaky. If something is failing more than one time we get to flag it as flaky, like it's something that we... It shows up the first time that it fails as a legitimate failure. Then if it's investigated, someone has to stop everything and look at it once something that's stable fails. If they see someone updated the read me and yet some test is failing, you just mark it as flaky and you can move on with your life. Then you're off the hook and it goes into a list of flaky tests that we have to review. The list is right now at about 350 or so. We also don't highlight flaky test results at all. They're there, you can click on the detailed page and get to them so you can see what failed. But when you're looking at if my commit is good to go or not you don't need to know if something that has a habit for breaking is actually failing. We don't alert developers when something that they didn't cause, some bad test that's been around forever, happens to fail when you make a change. This prevents people from panicking when there's a bad test. It reduces false positives for changes that are made in a completely different component, and it's a total win. We hide flaky infrastructure. If something goes wrong that would normally cause a red build, the bad build or failing build, then no one cares. They only care if the tests are working or not, not if Sauce Labs is down or Selenium flaked out or whatever. It just doesn't matter. Because they don't care, we hide that and that prevents panic from things that really should go to me because I'm the person responsible and not the entire team, or people who are making changes, like you guys. The other thing is we don't have build numbers in here unless you're looking for them. Instead we focus on commits, the things that you actually are pushing to GitHub.com. As builds complete they're organized around commits, which is pretty big. We can have basically a relationship where we have a commit and then that can have dozens of builds that ran that commit. But it kind of turns on its head what CI systems do today. Instead of having everything organized by builds, they're organized by a commit that then has these many builds to it in the interface. So developers know exactly where their code broke. Now when you get this email, instead of looking here and you don't have any idea what to do, you go to yo/tests and you see this page. I'll give you a minute to look at it, keep this up for a few seconds. Basically what the idea is that you can see exactly what the engineers on our team care about right away, which is basically the use of highlight to show that you need to take action with something. If something isn't highlighted on this page it means you're good or it's not complete yet. When it's something we know that you need to take action on it's highlighted. We see that the last thing that was pushed to dev-3x, the dev-3x branch, had stable unit test failure that shouldn't have failed. We're seeing that that's actually been happening for a while, so someone needs to go in and mark it as flaky. The other thing that we see right away is that there are environments that haven't run yet. Like I said, we do the whole rolling up builds into commits. We do the same thing for organizing all the different test results that we get by environment. You can have IE6 running on different infrastructure, like Sauce Labs or maybe in your own company's Selenium instance. What we do with yo/test is we understand that there might be slight variances in what IE6 identifies itself as when we all group it together under the same environment. Here we can see that no matter what infrastructure is running our tests, we can see what's still missing because of some infrastructure problem. So even though we have infrastructure problems that have plagued us for, like, right before giving this talk because I've been working on this talk instead of fixing these infrastructure problems. But we know exactly what's going on here. We know that this is just tremendously better and gives people what they actually need to know. If you click on, like say we just clicked on the first thing, it shows us what this commit was, who committed it, their GitHub avatar. It shows us a list of the four tests that failed and what component they're in, name, and what build. So you can click on the build and go right to the source of the problem. The other thing we have here is, this is the many different builds that all contributed to the health report, like the test report for this commit. If you want to look and see what happened you can go click on one of these. What's interesting about this is that you can run many different jobs for the same commit and then that's shown here. That's something that a lot of systems don't do. If you run many different builds like in Travis, you can restart builds in Travis. But what this does is we can restart builds and then notice okay, one thing worked here. We see that one test passed and then it failed and then it passed again. Because it's running the same code that can hint to our system that it's flaky. So that's pretty interesting. Like I said before, you can get to all the unstable test failures. You just have to... We don't highlight them so you have to scroll down and you eventually see them. As we go and try to fix unstable tests you can still see that they're there and that they're getting better. What this has all done is it's helped us ship more often and ship better quality releases. Things that normally we've missed in releases are caught by this system, and we've been using it since YUI 3.12. But what I'd like to show you really quick with the final few minutes I have is how it helps the most recent release. The most recent release was on GitHub under a release branch. This release branch inside of yo/tests, you can just click on dev-3x or when its release time release 3.13.0 appears above it. If you click on that you can see this is exactly a month ago when we released the last version. This is what we used. The rows here, if I just go in and focus on those, we had a problem right before the release where code that shouldn't have been checked in was checked in. What's really cool is we can go here and it's exactly what we need to know, like during release time if we're good to go or not. When it came down to it, the commit that was right before just a package.json change, we see that everything but Android 4 was tested and we had zero stable unit failures. We saw that we had stable Selleck failures but we had to go through them and see no, there actually weren't, these aren't things that would block the release. So that was really good, we had exactly what we needed to know. Then Android 4 didn't work because of some Sauce Labs temporary issue with them, so we had to test that by hand. But then we were good to go. Then we see here that we have way more environments that were tested for that final release, so we had a lot more things that were running and working. Briefly, just to recap some things in this interface that I think are key takeaways for people who are building a similar tool or like this kind of thing. We highlight things when you need to care about them. When something's incomplete it's gray, but it shows up with a background that's highlighted if it's something that needs to be fixed, like an incomplete environment that could be a problem with the CI system. That's something that I need to pay attention to and look at. If something we don't know, you know not everything has run yet, we can say something's zero but we gray it out and we use that to show that it's incomplete. Highlight when something's wrong and make it red. And then when we know that there's no more browser information that's going to come in and we know we're good, it shows up as green and there's no highlight. The question we're asking here is did my commit pass. That's what people care about, to watch their commit as it goes through the system. With this we think we got it. We think we just did that, and that's awesome. You go through and you can see that this one didn't. Then you have to go through and make changes, and that's awesome. Key takeaways here. You want to prevent panic from bad tests. When you start testing more, you're going to have bad tests. You're going to find that it might work on 10 different environments and then you add your Windows phone or whatever and it doesn't work. What are you going to do, drop everything and fix that? Probably not. It's really good to collect that data. What's been one of the biggest benefits we've got from this system is that it's allowed us to basically, when we would decide what to focus on we can see even though we're not focusing day to day on flaky tests, we're collecting that data. So we can see that this test is failing more often than all of the other flaky tests. Out of these 300 there are a couple of them that fail way more often than others, and that's what we need to focus on. Even though you're collecting test data for things that are flaky and not just ignoring them, setting some kind of ignore flag, we're collecting that but you're just not panicking everyone with those bad tests. You're preventing panic from bad infrastructure. Usually your engineers don't care about the infrastructure and one engineer will. That person should care but not everyone else. Then instead of focusing on build numbers you should be focusing on what exactly broke, so the commit that's breaking. How to do this? Classify flaky tests. Your CI system should understand what's flaky and what's not when you're starting to ramp up on testing more. Hide flaky tests and infrastructure failures from engineers and focus on commits. We're not done with this tool. This is really pretty new, it's only a couple of months old. A lot of people are interested in it, and I can't wait to have this available for this community to use directly. Even though it's helped us as we decide to ship a YUI release and we use it to test all the contributions that have been given to us for the last several months, we've been using our new CI system and then over the last couple months we've been using this to help us ship better releases. But we're not done. We would love to explore how we can share as much of this as possible with this community, starting with this talk. Hopefully what you've learned is some of the pitfalls of doing testing in a really big way and how you can avoid those pitfalls and test better. Thanks a lot. By the way, this is all made with YUI and Shifter and Jade and other fun stuff, and you can find that all out when you view these slides at reid.in/everywhere. My name's Reid and it's been really fun, so thanks guys. [applause] Jenny: Thank you. We have time for a question if anyone has one while we get set up for our next speaker. Let me know. Okay, Reid's going to be around the conference if you want to follow up with him directly. Did you have something? Alright. Audience member: I'm probably going to follow up later with Reid, but I have a quick question for him. How does Yeti fit into the whole multi-environment between machine kind of thing? So basically is there one that basically kicks on the whole, puts all together with the unit tests with Selenium, and how Yeti fits into the whole environment that you just talked about? Reid: Yeah. Everything goes through Yeti at some point. We basically are using it in a pretty standard way. We have a Yeti hub that we use. Like for example for Sauce Labs we have a Yeti hub that sits on the internet and our build system talks to that. Basically then Sauce Labs will go through that and then talk to our CI slave. With Yeti just out of the box you can say connect to this ondemand.saucelabs.com address and start these three browsers. And that's what we do. We start 3 Safari browsers or 3 iOS6 browsers and it just handles everything for us. Really quick, one thing that I did do recently is we now have some code that starts a new Yeti hub every time there's a build. Each build has its own copy of Yeti and that's preventing some flakiness and making some of our builds faster. We look forward to sharing that once it's working well. Jenny: Tell people what Yeti is. Yeah, so if you guys don't know what Yeti, Yeti is our tool for testing YUI. Not just for YUI, it also works with Mocha, Jasmine, QUnit, and I'm forgetting another one it works with. But yeah, it's basically open sourced and you can find out more at yeti.cx. Yeah, it's pretty awesome. If you are testing JavaScript and want to know how to automate things, it not only works with our huge CI system but it works really well with small teams and on your computer, so check it out. [applause]