Tip:
Highlight text to annotate it
X
Jenny Donnelly: Now we're going to deep dive a little bit with Reid about how the YUI project
does our testing more specifically. Welcome Reid.
Reid Burke: Thank you.
[applause]
Hey guys, my name's Reid. I work on the YUI team at Yahoo. For the last year I've been
working on improving how we test YUI. It's been really great and I'm really looking forward
to sharing everything that we've done in the last year with you guys.
This isn't the first time I've talked about this. It was really great for me to talk last
year about the whole purpose about why people test. If you didn't catch my talk because
this is your first time at YUIConf I'd encourage you to look at at least the slides from my
last talk, where I go over the pull point of what testing fulfills and what role testing
has when writing code.
But it is useful. Even though the point of my talk last year was basically that testing
isn't your goal, your goal is to write code that does what it's supposed to do so don't
go crazy with writing tests. But it is useful. We do it, everyone should be doing it. Like
the last talk, the last talk was all about it doesn't really matter what you call it,
it's just really important that you do it in a way that helps you check for common mistakes.
Again, you're not paid to write tests. The really cool thing that's come, about one year
ago it became possible for us with different technologies from Sauce Labs, for example,
offered where open source projects could use their resources for free. We were really interested
in that. That basically let us test on all the environments that we supported in YUI.
It's really neat that now this is all possible.
But to go over where we were a year ago, it was not so great what we were doing for testing
YUI. We had a single CI job basically, and it was just simple but had problems. We didn't
use Yeti, we were using something else. We didn't test Ajax, so things like I/O had to
be tested by hand. One browser was tested at a time, so in CI we would only have, one
browser had to run to completion then go to the next one, then the next one. We only had
unit tests so no examples were being tested in an automated way, which was really painful.
We only had two browsers and Node being tested in an automated way.
What we wound up having was before a release we would have to do this testing one way or
another. How we did that was a day-long get everyone in the room and test the code. You
can see how this is kind of incompatible with doing rapid releases. So they had to go, we
had to get rid of them, and we did, but in order to do that we had to find a way where
we can test the 15 target environments plus the stuff that we test because they're going
to be on the target environment list soon.
But there were a lot of problems that we had getting there. When you have one CI job things
are pretty easy. Another project I work on is Yeti. With Yeti, whenever something goes
wrong the only place that I have to test it on is Node. So whenever something goes wrong
I get one email that tells me that something's broken, because there's only one CI job. And
because it's so fast, all the testing's fast, I don't need to send out my testing across
many different CI jobs.
So I get an email. This is kind of ideal. I go get an email, I go look, and then it's
pretty clear what broke when you use Travis for pretty simple projects. But when you're
wanting to have automation for 15 different real browsers, and when you have about 7,000
tests for every single browser, every single one of those 15, that's over 100,000 that
you would test every single time you make a change. So we're getting beyond... I mean
if we had just one CI job we'd be waiting hours, and we definitely can't do that.
So we outgrew it. We had a whole bunch of other problems. But even though we outgrew
it we found a way to get over them, and that's what this whole talk is about is how we overcame
these challenges even though we couldn't use this simple model.
So, introducing what we do now. We now have, just with a CI slave we can test with PhantomJS
and Node. PhantomJS isn't really, we don't support it, but it's fast. Right away we can
see if there's a problem. We use Grover to do this, which is an open source project on
our GitHub. When we test in Node as well, with Selenium, we test every single IE. Even
IE6 is tested, all the way up to IE10, Chrome, Firefox, and we also use Sauce Labs.
Like I said before, Sauce Labs introduced this very generous plan where if you're an
open source project you can use their resources for free and they'll actually help you out.
We approached them and said that we're totally interested and we want to use them as much
as it makes sense, and they've been really awesome. We do use them for testing for the
things that we can't yet do any other way. It's actually really nice. We're hoping that
we can use more of them in the future. We use them for Safari and iOS.
Also, if you guys are remotely interested in Sauce Labs I'd love to talk to you. We
actually have Sauce Labs stickers that are going to be on the registration table, and
there's a whole bunch right here and you can pick some up. They've also given us promo
codes so if you're interested, if you don't have an open source project yet you use YUI
and want to test out what Sauce Labs does, they actually are giving people here two months
free so you can evaluate it and check it out. That's all up here. You can come see me afterward.
We no longer use basically a pile of hacks, we use Yeti to test our code. What Yeti gives
us is Ajax testing with built in echoecho, another project by Dav Glass. We can now test
YUI IO, we can launch browsers on Sauce and Selenium, we can run many browsers at the
same time. Instead of testing one browser at a time we can run 3 to 10 browsers or however
many you want all at the same time. That's been really huge. It speeds up testing quite
a bit. We can test our examples.
What we have is 11 out of 15 of our target environments are tested every time someone
makes a push to GitHub.com. Our CI system takes that and runs it on 11, and the other
4 work with Yeti but they're not yet automated.
For example, you noticed in the last slide where I was talking about what we automate,
Android isn't on there. That's because we find it really difficult to get Android working
in a way that is reliable on CI. But we can still test it before release and we no longer
have to go through every example. We can point Yeti at the examples and at the unit tests.
So we no longer have to have day long sessions, we just do this right before release and we're
done. It's all automated, and it's awesome.
Most of our environments are tested every single commit, so as soon as there's a problem
with those we can see immediately what went wrong. In theory. But actually about, I don't
know, about 3 or 4 months into the year when we had all this, even though we test everything
in theory it was really, really hard to find where the problems were happening. I'll share
that as the talk goes on.
But the testing was fast. With Yeti we had 8 parallel instances of IE being tested, 3
parallel instances of Safari. With that we get parallel builds. With parallel builds
we not only have each build running like 3 to 8 browsers but we would run many builds
at the same time. That would be something like a Travis job or a Jenkins job where once
you make a commit we have like four or five of those running at one time, going through
and then running themselves in 8 browsers.
Then also, to recap, we test all this. That's the large majority of our supported environments
are just automatically tested. This runs 3 hours of testing in under 90 minutes and it's
just getting lower as we tweak this and make this even better. It's still pretty good.
Like in 90 minutes we know on the majority of our environments if something works or
not.
But it gets really hairy. This is the in theory part. The in theory part is kind of two things.
One, even though we're running all these tests, even if there were no problems at all then
it's sometimes hard to get that information out and see, okay, so we have all these CI
jobs, how do we find out where the problem actually is? Because you no longer can just
go to that one page and find everything you need, it's now over dozens of pages.
Then the other problem is that we have a lot of moving parts. To give you an idea, we have
21 CI jobs, so 21 things in Jenkins for example. And then we have 3 branches. So that's 63
jobs. We have then about, let's just say 3, it can be up to 8, but 3 browser VMs. So that's
189 that could possibly be running at once. In order to run those VMs you have to have
the actual build slave machine VM that's actually running the test. That's about 252 machines,
which means 252 ways things can go wrong, if not more.
Testing is now really hard. Instead of just having about 4 machines on our last setup,
we now have hundreds. Things go wrong, things were flaky. We had too many stuff going on
at one time and we had way too many test results to make sense of all this.
The workflow for an engineer on our team was impossible. What they would get is an email
that says something went wrong, there was a failure. You've got to check something out
because everything is totally broken. But if you want to check out what the failure
is, this is the information that we get. Really out of the box stuff, and Jenkins isn't much
better if not just completely opaque. No one could understand this.
What you care about is did the changes that I just made break the build. Yes it broke,
something broke, but you don't know what. This is what we're dealing with, this is what
most people in this room are suffering with. In Travis you just see the build is broken,
it doesn't even tell you about what test. You don't know why it broke. Did NPM go down?
You don't know.
What we did is we tried to make it so that if someone had to take action on something
we would call the build unstable. So the build's unstable now. That means that there wasn't
any infrastructure problems. So somehow those 250 moving parts all meshed together perfectly,
but now there's a test failure. That's great. Now this is something that engineers can take
action on.
Well, not really. They go to the build and they have to click on the test results link
which takes them here. Okay. And this page takes about 30 seconds to load usually. If
you use Jenkins with anything more than one job you know what I mean. Then you click on
this page. Okay, now we finally see test results. This is terrible. No one should be doing this.
It should say right in the email ideally.
But the other problem is that some of these tests that are failing are what Andrew talked
about in the community talk earlier. Some of these tests are failing because this is
the first time we're running these tests on more than two browsers, more than IE and more
than just an ancient version of Firefox. Once we threw all this stuff together we noticed
that there were a ton of failures. If you're not testing on all these things in an automated
way right now, you're also going to run into this problem if you have more than just a
few tests.
This is just one example. We have these tests. Sometimes they pass, sometimes they don't.
We call those flaky. But the thing is if these are failing the engineers don't need to take
action on it, not right away at least, because that person's commit didn't break their, they
didn't cause this problem. These just are flaky. We address them and reduce the amount
of flaky tests we have as much as we can every release, but the fact is we have them.
The good news, though, is that over 99.5 per cent of our tests, I think it's 99.8 something,
are not flaky. So out of the hundreds of thousands of tests we run only 300 are flaky at all.
That's been getting lower and lower. But the problem is that that tiny fraction of 1 per
cent is causing hell for our developers, and that's totally not acceptable.
This problem is repeated for those 63 jobs. Each one of those can have a flaky test or
some kind of other problem and then it spews out emails. It's like what are you going to
do? You have this problem. So what do you do? Well at Yahoo we have a dashboard for
this which is awesome, so we get to use the dashboard and that's going to solve all our
problems.
Well, no. We look at this dashboard. The question we're asking here is did my latest commit
pass, did the last thing that I push, did that work? When you look at this it doesn't
answer that question at all and it just makes you more confused.
What it's doing is just taking Jenkins jobs and showing them the history, and that's useful,
yes. We see more information, we know more. It's not over a dozen pages and it's fast,
like you go to this page, it loads instantly. But it just doesn't work out. You're looking
at this and you see there's no indication that the builds that you're looking at, the
last build that ran, is actually the latest code.
The other thing you don't know is if it's yellow is that a flaky test or not. We don't
know. If it's red was that an infrastructure problem or is that a legitimate problem? We
don't know. This is all really bad.
You get flaky infrastructure with all kinds of sources for flaky infrastructure, Sauce,
Jenkins, Build Slaves, Selenium, Yeti. It's just what's going to fail first. You're going
to have this problem. Tests are flaky for the reasons I talked about earlier where they
work, kind of, almost all the time, but it's the one time they don't that's productivity
sucking, which sucks.
The other problem that we saw with that interface is that, and we see with, this is everywhere,
this is in Travis, this is in Jenkins. Our most popular build systems all have the problem
where they use build numbers. Which, for the most part aren't really, we just don't care.
Build numbers are something like don't answer any question that the developer cares about.
The developer cares about did my commit work, did my commit pass, is it functional.
When you're looking at all these different jobs, you see that they all have different
numbers. You can have this job which is 576 and then another job which is testing the
same commit that has a completely different number. There's no way that you can look at
this and say that these two things, they might be running the same code, they might not be.
And that's terrible.
When you have this many tests, once you start writing a ton of them you notice that Jenkins
will become slower and slower and slower and slower. It works fine. The problem is it works
fine until you come to depend on it. You're wanting instant answers and it doesn't give
you that. It takes over 10 seconds to load something or whatever, and that's pretty bad
too.
Ultimately what happens is that nobody's responding to build failures. When you get an email you
don't know what to do, you just do nothing, and then the problems keep going. Halfway
through the year we had this problem and it wasn't getting any better. Even though we
were doing all this testing and we did all this work, we didn't see any end in sight
for actually improving the quality of our code and being able to measure it because
it was all over the place. This had to change.
The solution we made to this is something called yo/tests. Internally at Yahoo you actually
literally type in yo/tests and it takes you to this page. You saw it earlier, and I get
to show it to you now. Basically what this does, the pain that we had it fixes that with
three things.
First it classifies flaky tests. Yo/tests understands when something's flaky. If something
is failing more than one time we get to flag it as flaky, like it's something that we...
It shows up the first time that it fails as a legitimate failure. Then if it's investigated,
someone has to stop everything and look at it once something that's stable fails. If
they see someone updated the read me and yet some test is failing, you just mark it as
flaky and you can move on with your life. Then you're off the hook and it goes into
a list of flaky tests that we have to review. The list is right now at about 350 or so.
We also don't highlight flaky test results at all. They're there, you can click on the
detailed page and get to them so you can see what failed. But when you're looking at if
my commit is good to go or not you don't need to know if something that has a habit for
breaking is actually failing.
We don't alert developers when something that they didn't cause, some bad test that's been
around forever, happens to fail when you make a change. This prevents people from panicking
when there's a bad test. It reduces false positives for changes that are made in a completely
different component, and it's a total win.
We hide flaky infrastructure. If something goes wrong that would normally cause a red
build, the bad build or failing build, then no one cares. They only care if the tests
are working or not, not if Sauce Labs is down or Selenium flaked out or whatever. It just
doesn't matter. Because they don't care, we hide that and that prevents panic from things
that really should go to me because I'm the person responsible and not the entire team,
or people who are making changes, like you guys.
The other thing is we don't have build numbers in here unless you're looking for them. Instead
we focus on commits, the things that you actually are pushing to GitHub.com. As builds complete
they're organized around commits, which is pretty big. We can have basically a relationship
where we have a commit and then that can have dozens of builds that ran that commit. But
it kind of turns on its head what CI systems do today. Instead of having everything organized
by builds, they're organized by a commit that then has these many builds to it in the interface.
So developers know exactly where their code broke.
Now when you get this email, instead of looking here and you don't have any idea what to do,
you go to yo/tests and you see this page. I'll give you a minute to look at it, keep
this up for a few seconds. Basically what the idea is that you can see exactly what
the engineers on our team care about right away, which is basically the use of highlight
to show that you need to take action with something. If something isn't highlighted
on this page it means you're good or it's not complete yet. When it's something we know
that you need to take action on it's highlighted.
We see that the last thing that was pushed to dev-3x, the dev-3x branch, had stable unit
test failure that shouldn't have failed. We're seeing that that's actually been happening
for a while, so someone needs to go in and mark it as flaky.
The other thing that we see right away is that there are environments that haven't run
yet. Like I said, we do the whole rolling up builds into commits. We do the same thing
for organizing all the different test results that we get by environment. You can have IE6
running on different infrastructure, like Sauce Labs or maybe in your own company's
Selenium instance. What we do with yo/test is we understand that there might be slight
variances in what IE6 identifies itself as when we all group it together under the same
environment.
Here we can see that no matter what infrastructure is running our tests, we can see what's still
missing because of some infrastructure problem. So even though we have infrastructure problems
that have plagued us for, like, right before giving this talk because I've been working
on this talk instead of fixing these infrastructure problems.
But we know exactly what's going on here. We know that this is just tremendously better
and gives people what they actually need to know.
If you click on, like say we just clicked on the first thing, it shows us what this
commit was, who committed it, their GitHub avatar. It shows us a list of the four tests
that failed and what component they're in, name, and what build. So you can click on
the build and go right to the source of the problem.
The other thing we have here is, this is the many different builds that all contributed
to the health report, like the test report for this commit. If you want to look and see
what happened you can go click on one of these. What's interesting about this is that you
can run many different jobs for the same commit and then that's shown here. That's something
that a lot of systems don't do.
If you run many different builds like in Travis, you can restart builds in Travis. But what
this does is we can restart builds and then notice okay, one thing worked here. We see
that one test passed and then it failed and then it passed again. Because it's running
the same code that can hint to our system that it's flaky. So that's pretty interesting.
Like I said before, you can get to all the unstable test failures. You just have to...
We don't highlight them so you have to scroll down and you eventually see them. As we go
and try to fix unstable tests you can still see that they're there and that they're getting
better.
What this has all done is it's helped us ship more often and ship better quality releases.
Things that normally we've missed in releases are caught by this system, and we've been
using it since YUI 3.12.
But what I'd like to show you really quick with the final few minutes I have is how it
helps the most recent release. The most recent release was on GitHub under a release branch.
This release branch inside of yo/tests, you can just click on dev-3x or when its release
time release 3.13.0 appears above it. If you click on that you can see this is exactly
a month ago when we released the last version. This is what we used.
The rows here, if I just go in and focus on those, we had a problem right before the release
where code that shouldn't have been checked in was checked in. What's really cool is we
can go here and it's exactly what we need to know, like during release time if we're
good to go or not.
When it came down to it, the commit that was right before just a package.json change, we
see that everything but Android 4 was tested and we had zero stable unit failures. We saw
that we had stable Selleck failures but we had to go through them and see no, there actually
weren't, these aren't things that would block the release. So that was really good, we had
exactly what we needed to know. Then Android 4 didn't work because of some Sauce Labs temporary
issue with them, so we had to test that by hand. But then we were good to go.
Then we see here that we have way more environments that were tested for that final release, so
we had a lot more things that were running and working.
Briefly, just to recap some things in this interface that I think are key takeaways for
people who are building a similar tool or like this kind of thing.
We highlight things when you need to care about them. When something's incomplete it's
gray, but it shows up with a background that's highlighted if it's something that needs to
be fixed, like an incomplete environment that could be a problem with the CI system. That's
something that I need to pay attention to and look at.
If something we don't know, you know not everything has run yet, we can say something's zero but
we gray it out and we use that to show that it's incomplete. Highlight when something's
wrong and make it red.
And then when we know that there's no more browser information that's going to come in
and we know we're good, it shows up as green and there's no highlight.
The question we're asking here is did my commit pass. That's what people care about, to watch
their commit as it goes through the system. With this we think we got it. We think we
just did that, and that's awesome. You go through and you can see that this one didn't.
Then you have to go through and make changes, and that's awesome.
Key takeaways here. You want to prevent panic from bad tests. When you start testing more,
you're going to have bad tests. You're going to find that it might work on 10 different
environments and then you add your Windows phone or whatever and it doesn't work. What
are you going to do, drop everything and fix that? Probably not. It's really good to collect
that data.
What's been one of the biggest benefits we've got from this system is that it's allowed
us to basically, when we would decide what to focus on we can see even though we're not
focusing day to day on flaky tests, we're collecting that data. So we can see that this
test is failing more often than all of the other flaky tests. Out of these 300 there
are a couple of them that fail way more often than others, and that's what we need to focus
on.
Even though you're collecting test data for things that are flaky and not just ignoring
them, setting some kind of ignore flag, we're collecting that but you're just not panicking
everyone with those bad tests.
You're preventing panic from bad infrastructure. Usually your engineers don't care about the
infrastructure and one engineer will. That person should care but not everyone else.
Then instead of focusing on build numbers you should be focusing on what exactly broke,
so the commit that's breaking.
How to do this? Classify flaky tests. Your CI system should understand what's flaky and
what's not when you're starting to ramp up on testing more. Hide flaky tests and infrastructure
failures from engineers and focus on commits.
We're not done with this tool. This is really pretty new, it's only a couple of months old.
A lot of people are interested in it, and I can't wait to have this available for this
community to use directly. Even though it's helped us as we decide to ship a YUI release
and we use it to test all the contributions that have been given to us for the last several
months, we've been using our new CI system and then over the last couple months we've
been using this to help us ship better releases.
But we're not done. We would love to explore how we can share as much of this as possible
with this community, starting with this talk. Hopefully what you've learned is some of the
pitfalls of doing testing in a really big way and how you can avoid those pitfalls and
test better.
Thanks a lot. By the way, this is all made with YUI and Shifter and Jade and other fun
stuff, and you can find that all out when you view these slides at reid.in/everywhere.
My name's Reid and it's been really fun, so thanks guys.
[applause]
Jenny: Thank you. We have time for a question if anyone has one while we get set up for
our next speaker. Let me know. Okay, Reid's going to be around the conference if you want
to follow up with him directly.
Did you have something? Alright.
Audience member: I'm probably going to follow up later with Reid, but I have a quick question
for him. How does Yeti fit into the whole multi-environment between machine kind of
thing? So basically is there one that basically kicks on the whole, puts all together with
the unit tests with Selenium, and how Yeti fits into the whole environment that you just
talked about?
Reid: Yeah. Everything goes through Yeti at some point. We basically are using it in a
pretty standard way. We have a Yeti hub that we use. Like for example for Sauce Labs we
have a Yeti hub that sits on the internet and our build system talks to that. Basically
then Sauce Labs will go through that and then talk to our CI slave.
With Yeti just out of the box you can say connect to this ondemand.saucelabs.com address
and start these three browsers. And that's what we do. We start 3 Safari browsers or
3 iOS6 browsers and it just handles everything for us.
Really quick, one thing that I did do recently is we now have some code that starts a new
Yeti hub every time there's a build. Each build has its own copy of Yeti and that's
preventing some flakiness and making some of our builds faster. We look forward to sharing
that once it's working well.
Jenny: Tell people what Yeti is.
Yeah, so if you guys don't know what Yeti, Yeti is our tool for testing YUI. Not just
for YUI, it also works with Mocha, Jasmine, QUnit, and I'm forgetting another one it works
with. But yeah, it's basically open sourced and you can find out more at yeti.cx. Yeah,
it's pretty awesome. If you are testing JavaScript and want to know how to automate things, it
not only works with our huge CI system but it works really well with small teams and
on your computer, so check it out.
[applause]