Testing Chromium

>> HAWKINS: I'm going to go over in this presentation the rationale for testing some of the types of tests that we have available to us and the tools that we use to do testing. The Chrome team has around 200 developers and we average about 100 commits a day. So with all that code flowing in and out we need to make sure that stability remains a high priority and one ways--one way to assure that stability is to test the code that we're running. So one of the things that test do is make sure that the code--the feature that you're running runs the way you expect and you can't have people hand-testing your feature everyday. So the tests make sure you don't have any regressions that pop-up. Another thing that tests do is they document what your code is expected to do, so that when other people come and read your code, they understand the rationale behind what you wrote and the logic behind it. Tests also guide design of the code in the sense that when you write test first, you start thinking about, "How am I going to implement this way so that this tests pass?" It really modifies--it could modify your design heavily. A big piece of the infrastructure that we have in Chrome is--are our Trybots. So Trybots are a pool of machines that you can send Try jobs to that will take your Change List and run them through our suite of tests. They'll mainly run them on the three main platforms; Windows, Linux, and Mac, but it is possible to send in a bot parameter to specify another bot you want to run in. Some of the bots that we have available are Valgrind, Chrome OS, Linux views. So when you're changing code that affect those platforms it's good to run those tests as well. The main test suite takes about an hour to run on average for the three platforms. So can you meet yourself on that side? So if you want to cut down the running time of your Try test--your Trybot, you can pass in the T-parameter which takes an argument of the test program name, call in the filter that you want to use. So, for example, if I want to run a specific unit test I'll pass in -t unit tests and then the name of the test that I want to run, and that means it will only run that specific test, which greatly cuts down on the testing time. There's also a concept for Trybots known as Last Known Good Revision. The Last Known Good Revision is set by the official builders on the waterfall and it's the last green build that we had. So whenever we get a green build on all the platforms, the Last Known Good Revision or LKGR is updated. However, the Last Known Good Revision could be behind several revisions from what you want to test so you can pass in the revision using the -r parameter. Whenever you upload a CL, the Try jobs used to be automatically started, but that's no longer the case because we have so many CLs uploaded these days so you need to make sure that before you commit you at least run your Try job for your latest patch set. That makes sure that any failures that could happen in your CL are cut short. So you'll realize the failures are there before you actually make the commit, that keeps our main tree a lot greener. The link at the bottom of this page is the try-server waterfall. So if you don't find your Try job on the CL page itself, you can head over to that link and then you can find your LDAP or your username and you can see the status of your Try job. There are three main types of tests and I'm going to start with unit tests. These are our lowest level tests. They test units of the codes, so a method on an object such as Watchdog.ArmAtStartTime. So the unit tests for that method would test all of the logic pass in that method. Good thing about unit test is they document exactly what's going to happen in this method for all the input parameters that you could specify and you verify the output is correct. Unit tests are small on the--so usually they run pretty quickly and they also don't run with other tests. So they're isolated and you only test exactly what you want and you see those results for that bit of code. As an example, I'm going to use the AutoFill feature which has an InfoBar for credit cards, and our far writing a unit test for this I would be testing the AutoFillCCInfoBarDelegate class itself which is the lowest level implementation of this InfoBar. So there's three buttons in this class and there's a link and there's an icon. So the methods on the AutoFillCCInfoBarDelegate specify how this InfoBar is supposed to look. So I could run those methods and make sure that I get the three buttons back, the text of those buttons, and make sure the link has the appropriate URL. There's unit tests as a target itself but there are also unit tests for our other major modules in Chrome; some of which are listed here. There are actually quite a few unit tests which is important, we need a lot of tests. The next step up in our testing types is Browser tests. Browser tests create a browser test--or create a browser object in the test so that you can test how your code is interacting with the browser object itself. It's actually a specialized unit test but there's infrastructure setup so that it loads the browser automatically and you have access to browser internals. They run in at different process and the sandbox is disabled. There is a running message loop, so you got to take that in consideration when you run things on different threads and the default thread is the UI thread. Another thing that is taken into consideration when you're testing your test, running the test yourself, is that the browser window is not visible by default, but you can specify that it be visible. Going back to the AutoFill InfoBar, an example of the Browser tests for the AutoFill InfoBar would be; load up an InfoBar; make sure that the InfoBar is in the tab that you specified the InfoBar is in; make sure that there's only one InfoBar per tab; make sure that you can load up multiple InfoBars, one per tab, but that there's more than one. You can call the method on the InfoBar class to close--to press the close button and then make sure that in the browser object the InfoBar is no longer there. UI tests are high level end-to-end testing integration test. They're an example of black-box testing and that you don't have access to object internals and you have to implement the test through an automation proxy there's an Automation API which allows you to control the browser. Now there are several different ways we use the Automation API. The main way is UI tests which are test written with Automation API. There are also automated UI test which use fuzzing. They send--there's a list of commands that are implemented by the Automation API. For example, click a button in the interface, load up a page, load up a new tab, and the fuzzing creates a different--creates a list of these different combinations and runs them and mutates the list. So you're basically getting all these different actions that you can take with the browser and running those actions and seeing how they interact. There's another test type called interactive UI tests which are necessary for the Windows Bots and that there is--these tests interact with the UI by clicking or moving the browser around and the bots themselves have to be specially configured to take advantage of this. The last one is pyauto which is written by--you can write--it's a Python Interface for writing automation test. It's a really easy way to run--to write UI test in Python. So we'd have an API that loads a webpage with the form that we've written. The next API could fill-out that form with data that we specify. You would have an API that clicks a button using the actual OS infrastructure for clicking, programming the click to happen on a button. And then at that point, you check that the InfoBar shows up as you expect. The next API could also use the clicking API to simulate a clicking on the "Save" button or the "Don't Save" button. For the save button, you want to make sure that the data's saved. There'd be another API to say, "Is there a profile data? Does it match what we expected?" And if you click on the "Don't Save" button, you also want to check that the CC data, Credit Card data was not saved. So whenever you're running a test and it's not working out exactly as you planned, maybe you got a crash or it's just not functioning correctly, you should be able to debug the test. On Mac and Windows, it's pretty easy to debug test using the visual interface whether it's XCode or Visual Studio. You just compile the test target and then debug that test target. In Linux, it's possible to do that but in most cases you're going to use GDB. So you pass in the GDB and then the test target itself, and then pass in the GTest filter of the test you want to run with Args then you should be able to step through the test and figure out why the test isn't working like you think it should. We have a few frameworks for writing test that make test writing a lot easier. GTest is the biggest piece of how we write tests and it has a lot of functionality for you. You can check out the link at the bottom and there's a lot of good documentation on that. I'll give an example. So the high level way you'd look at test in GTest is that you have a test program which would be unit test, binary, browser test binary, UI test binary; that's your test program. Your test case would be AutoFillInfoBar test which has mini-tests inside of it. Each test is testing a specific piece of that functionality whether it'd be a method on the object, interaction with the browser, or a high level end-to-end UI test. As an example of the GTest framework, it's really easy to setup a test. You have the test macro, the test case name, and then the test name itself. In this one, we're using assert equal which is a macro used to check that something is how you expect. It's an assertion, so if the thing that you're testing is not valid the test itself will stop. We also have expect equal which will not cause a test to stop if the thing that you're checking is not true. Something that you can use to reduce duplication of code are Test fixtures. They're a data setup for every test that you run in a test case. I'll show you an example of that. So QueueTest is the name of the Test fixture. We inherit from testing test. There's two methods that you could possibly overwrite; setup, you usually want to override, and setup is run at the beginning of each test case. In this setup, we are enqueueing some values into queues-- several queues. For this--and for this test case, we don't need anything to be torn down, but for example, if you weren't using scope pointers and you had allocated data, in TearDown is where you'd want to destroy that data. So using the same QueueTest Test fixture, we have two test here; is initially empty--IsEmptyInitially. We just expect that dequeues are empty and they are. DequeueWorks, we load up data in our Test fixture. So if we go back here, we have the enqueues of each of the queues. And then you expect that you dequeue the value you get back. There are a lot of assertions that you can use. This map gives you quite a few of them, the asserts, and then their--the fatal versions are the asserts, as in they will stop the test and the test won't run anymore. Usually, I want to do these on things that will cause the test to crash in another way that you don't expect. For example, asserting that a vector is a certain size. You want to make sure because then you're going to access all the elements of the vector. So you don't want to access those without asserting first. Expect the expect assertions won't cause the test to crash. They'll just cause a test failure if they're not--if they don't evaluated it true. Another piece of framework that we use is Google Mock. Google--mock objects allow you to provide an object that you don't want to implement fully you just want to specify the interaction with that object. For example, in this one, we're going to--we'll start a PersonalDataLoadedObserverMock. So this is an observer in a messaging sense. And what we have here is we're going to mock out a method on personal data loaded which will be called by some method or by some class. And the action that we expect to tape is--take is QuitUIMessageLoop. We have some assertions inside the action and whatever action we need to take so we're going to quit the current message loop in this action. In the test, we create a profile and we set the profile info. So inside SetProfileInfo, the personal data observer is going to be called and we expect that call to happen. We didn't have to create a whole personal data observer though, it's just a mocked-up object with a mock method. We don't need--we don't care anything else about the object itself, just that this one method gets called. So we have EXPECT_CALL on the observer object that we care to be called. And then we say WillOnce(QuitUIMessageLoop), WillOnce means we expect it to happen exactly once and we expect the action that'd be taken as the QuitUIMessageLoop. There are will-repeatedly, which means after the call the SetProfileInfo, the method that you're expecting--the action that you're expecting to happen could happen several times, and that's a good way to specify that. At the end of this piece of code, we need to run the current message loops because we've queued up this call in the message loop and then we need to get back to that message to be called. So we run the message loop. We have several tools for testing. One of the most useful one is Valgrind which is a memory error checker. And so, memcheck is a tool that checks for things like leaks, uninitialized memory, things of that nature, using the wrong delete. If you used delete instead of delete array when it should be delete array, that'll notify you of that. And then you could see the command, run it, right there. There's also ThreadSanitizer. ThreadSanitizer detects date erases and the commander on that is down there as well. Mostly though, you don't have to run this yourself and that we have Valgrind bots on the main waterfall which continuously run these tests, they take about an hour and a half to run. And they will catch the most--they will catch this error for us. And then once we do, we file a bug and we notify the owner of the code that they have a memory error so then you can run--once you are notified that you have a memory error, you can run the test yourself to debug it. An important part of testing is Code Coverage; something we need to keep in mind. Code Coverage is how much--how many lines of code are actually executed by your test. So, obviously, the higher your code coverage, the more likely it is you're going to catch errors. Although, you can never get--I mean, you can get to 100% code coverage theoretically, but even if you do on average, you only find about half of the bugs because they're--the number of input and output is just too much to test. So there are three pieces of information with Code Coverage. The number of lines instrumented which are the number of lines that are actually compiled into your test including testing code and source code, the number of lines that are covered which are lines that were hit when your test--when you ran your test. The missing piece of information is code that that was a part of the source code but is not actually compiled in. And this is something you want to take on to account for the different bots. For example, on Linux, if there was a Windows-only test that we don't specify as Windows-only and it's not being compiled on Linux, the instrumentation will think that these are missing lines, because they are, and that will lower your coverage. We actually want to have the most accurate coverage that we can possibly have, so there's a way to go in and say, "This is a Windows-only test. We don't need to run it on Linux or Macs so don't count it against us." Incremental coverage is a way of testing for each commit in the--in the tree how many lines of code are tested in that commit for the additional lines; so any plus line in a diff, how many of those lines are covered. Ideally, we want to have 50% incremental coverage. So in any commit you have of all the new lines you want 50% of those lines to be tested. At that point, you're at breakeven. You're not going to be losing coverage but you're not gaining anymore coverage. At some point, we'd want to bump that up to 75%, but 50% is a tough enough target as it is. We're currently working on adding the ability to track incremental coverage and setting that 50% target so that it will be a bot on the main waterfall that will alert any--any build that has a commit in it that is not 50% incremental coverage, the bot will go red and the owner of that test will be--or the owner of that commit will be notified that they didn't have 50% incremental coverage. And we're working on that. We're getting pretty close to that. We do have a TryBot for coverage out there. There's a Linux bot available; not right now. And once we get more machine infrastructure, we're going to get a Mac and Windows bot as well. The jobs take about an hour and a half, but it's a good way to see--a good way to find out how much your test affects your code; how many--how good is your coverage for the test that you just wrote. There are--there are three bots up right now that run coverage continuously even if they don't have incremental coverage. They're at the--they're on the experimental waterfall which is the link at the bottom of the page. If you go to that page and you search for coverage on the page, you'll see the Mac, Linux, and Windows coverage. Those will give you our current coverage analysis and you can look at any code that you currently work on to see how good your coverage is. One thing that crops up quite frequently with the project of this size are failing tests. And failing can mean a lot of thing. Failing could be an expectation or an assertion doesn't pass in the test. It could be a crashing test, a hanging test. All of these are bad things, but some are worse than others. The fact that we're disabling test is not a good thing because disabled test doesn't run at all. And so it's effectively dead code at that point. There are two types of test that need to be disabled and those are crashing or hanging test. And we disabled those because the rest of the testing infrastructure does not continue to run. So, for example, with unit test, if you crash on the very first test that you run in the unit test you're not going to run any of the rest of your test which is the worst. So for those, it's okay to disable. To be even more specific, you should use platform-specific defines that say, "Only disabled on Windows," because it's only crashing on Windows. And you want to be as specific as possible when disabling. So we could still have continued coverage on Linux and Mac. Whenever you do disable a test or whenever you change the moniker of a test to be disabled FAILS or FLAKY, you want to make sure that you add a comment to the code, file a bug, and then add that bug to the comments so that whoever comes to look at the test to possibly fix it they have that bug for reference which should probably include the log to the failing test. There's a FLAKY moniker which we should use for FLAKY test. FLAKY test fail spuriously. So they might run green for a couple of times and then fail, and then go green for a couple more times. They just fail on and off. There's also FAILS which is used for test that fail continuously. The infrastructure behind FLAKY and FAILS is actually exactly the same. The benefit to the developers is--are that you know right when you look at the code, "Okay, this test is always failing. There's some error that's continuously happening," or, "There's some sort of flake in the system. It's unknown but we're going to have to look for a FLAKY issue here." There's a way to disable or to change the monikers for several test at one time. So, say, you have a test that's--or several test cases that are only failing or crashing on Mac OS. You can use the SKIP_MACOS macro which will take a test name and then to say disabled_test, and you use those two pound marks to actually put the name of the test in there. And then for every test that's failing for that platform, you say SKIP_MACOS and then the name of the test. One thing you don't want to do is if there is a Valgrind or a ThreadSanitizer error, you want to make sure that you don't necessarily disable it at the code level if at all possible. For example, if you have a leak and the Valgrind bot is red because of your leak, you shouldn't disable your test. You should either add a suppression, which is most likely for a leak, or if the test is crashing when only run under Valgrind, you should go into the Valgrind files, which they have a list of test that should not be run just for Valgrind and ThreadSanitizer, and you should add your test to that list. So in summation, we need to write a lot of tests. We need to then write some more tests. We have a lot of tools that you can use, make the most of them, especially Valgrind, ThreadSanitizer. If you have a failing test that is failing under one of these tools use that tool locally to debug in. We need to take care of coverage to make sure that our features are as covered as possible for the things that they need to be covered for. And whenever a test is failing, really think about whether you should be disabling it or marking it as FAILS or FLAKY so that we keep our coverage numbers up. So that's about it. Do we have any questions? Okay. Thank you very much.