Carat - Collaborative energy debugging

PHIL: I'll introduce Adam Oliner, who got his PhD last year at Stanford, and he's going to talk about some research that he's been doing as a post-doc at UC-Berkeley. ADAM OLINER: All right, thanks, Phil. So this is joint work with Anand and Ion at the AMP Lab at Berkeley, and a couple of people at the University of Helsinki, Eemil and Sasu. So mobile is a hot topic as you know. Sometimes this is literally true. This is an article that someone-- well, it's a letter that someone sent into Lifehacker complaining that a phone, which they believed to be sitting idle in their pocket, was actually getting hot. So it was clearly doing something even when it wasn't supposed to be. And this user is asking questions like, how do I make this stop and why is it doing this? And so, anyone who has a smartphone has probably had a similar experience. So they start their day and the phone is fully charged. And they maybe don't do some work for a while. It sits idle for a bit. Maybe they actually get something done. It sits idle. And at some point, they take it out of their pocket and it complains that the battery's been depleted. This, I think, is a pretty much universal experience for people who have these kinds of devices. And for the user, there are I think three primary questions that they'd like to answer. First is why the battery is draining. Is it something that they did. Is it an app that they're running? And then second question, one which Carat is uniquely positioned to answer as you'll see over the course of the talk, is whether or not that drain is normal. So if the user does the same thing tomorrow, should they expect to see similar battery drain? If another person who has a similar phone does the same behavior, should they expect to see similar battery drain, and so on. And then finally, what can the user do about it? Is there a behavior that they can change or a setting that they can change to improve their battery life? So today I'll be talking about a tool that we built called Carat, and a method for analyzing data that enables us to answer those three questions that we were talking about in the previous slide. I'll describe how we go from the samples, the state samples that we take on individual devices, and how we aggregate that and use that to do energy diagnosis for those same users. I'll talk about some of the ways we deal with uncertainty in practice. So as you measure things on devices, there are lots of sources of uncertainty and imprecision. And I'll describe some of the ways that we deal with those. I'll talk about our implementation, mostly on the analysis side. And then finally, describe our deployment and some of our results. OK, so just to give you a little bit of background, prior approaches had been pretty much strictly ad hoc. So there was some work out of Purdue that worked with looking for specific kinds of misbehavior. So, in particular, on Android they were looking at something called a no-sleep bug. This was a violation of a locking discipline that would prevent the screen from going to sleep. And so if your energy problem was not related to a no-sleep bug, that approach would not help you to figure out what was going on. Many of the previous approaches were also intrusive. They would require you to instrument the operating system, or have access to the source code of a particular app that you were trying to debug. And this is frequently prohibitive, especially for lay people. And then finally, you can find dozens and dozens of apps on pretty much every app store that claim to help you with your battery life. And almost universally, these are generic pieces of advice. They'll tell you, kill all of your background apps or dim the screen. Basically, just don't use your phone as much is essentially the advice that they give, which is also not very helpful. It's not actually telling you what was going on and whether or not that was surprising behavior. So our approach looks like this. This is a kind of overview of the infrastructure. We collect data. Rather than from a single device, we collect it from the crowd, aggregate that in the cloud, and do a statistical analysis on it there. And then return to the user's information for their particular device and their behaviors what seems to be causing the energy problems. And whether or not that's normal. So as far as we're aware, this is the first collaborative approach for diagnosing these sorts of energy problems. So we built Carat as a mobile app for both iOS and Android. And it provides personalized energy debugging. So what it will tell you is what's misbehaving, whether that's normal, what you can do about it. And those are things we've mentioned before. And furthermore, Carat will also try to quantify how much it will help. And I'll describe how we do that. So the design goals of this project were to take a particular design point in the space. And so we've chose the space that was the most invasive method that works on both iOS and Android. So there are things that you can do that are specific to either platform, but we wanted to do, what is the sort of most you can do that works on both of these? And so the primary constraint there was, what was eligible for Apple's App Store? That was sort of the upper bound on what we were able to do. And so the goal here is to just investigate how far we can take diagnosis given only that sort of information. So there are lots of questions related to, well, what if you measured this? What if you measured that? And we're looking at those, too, but that's not the goal of this particular stage of the project. And I'll be talking today just about Carat as a method that works on both of these platforms. So this is what some of the screens of Carat look like. On the left and on the center, you see what's called the Action List for iOS and Android. And what this tells you is, based on the apps that you're running, which ones should you kill or attempt to restart, so that you can improve your battery life. So if it says, say kill Pandora, what it's suggesting is that Pandora is running on your device. And that it seems to be consuming a lot of energy. And in particular, it's estimating that you'd get about an hour and a half more battery life if you killed it. Which is significant. On the right side, you see one of the other screens, which is the Device Screen. And this gives you a little bit of information about how long you should expect your battery life to last as you currently use it. And finally, something called a J-Score. The J-Score tells you your sort of percentile battery life relative to other similar users. So this particular screen is from a device that was in the 64th percentile. Meaning 64% of the other devices got worse battery life than this particular device. So sometimes people would complain that they were experiencing battery problems, but had a really good J-Score. And this essentially meant that they thought they had bad battery life, but relative to other people, they were actually doing pretty well. And vice-versa. There are people who have extremely low J-Scores and seem really happy with their battery life, which I guess is the merit of low expectations. There were a lot of privacy concerns when we were first developing the app. So people expressed a little bit of wariness about, well, what sort of information are you going to be reporting? There have been previous incidents about things like Carrier IQ and so on that reported rather large amounts of personal information. And so we went to great pains to make sure that we were not collecting anything that was personally identifying, or otherwise endangering people's privacy. In the end, it turned out that people say that they care about privacy. But in practice, it doesn't seem that they do. And actually, we would've, in retrospect, liked to have collected more personally identifying information so that we could help individual users with support. So we would get emails from people saying, hey, I want to understand this particular set of recommendations, but because we collected nothing that ties that person to the data that we've collected from them, we have no way of looking up their data and telling them anything. AUDIENCE: So what's like an easy thing to collect that? Give me the benefits. ADAM OLINER: You mean, what is an example of a personal identifying piece of information? I mean, we could literally just say, hey, could you type in your email address. And then we can tie that to the records and then be done. And we'll probably do that in the later version, just because it seems like almost everyone would be willing to provide that information. And it would be helpful to us. At any rate, so this was something that we were initially concerned about, but it turns out that we probably should have been more liberal about what we collect. OK, so now, that's a kind of overview of what Carat looks like. And I'll start to dive in here and describe the process of how we go from the data that we collect on the device to the diagnosis. So the way Carat works is that periodically, it will wake up and get some time to write down information about the state of the device. So in particular, it'll write down the time stamp. It'll write down the battery level. And then a feature vector that includes things like what apps are running, does the device have Wi-Fi access, what version of the operating system is it running, and so on. And it doesn't just collect one sample, it collects samples over time. So in this case, you see that the battery in the second column there is depleting over time, certain apps are being closed or opened, Wi-Fi access is intermittent, and so on. So this tells us what the battery level is instantaneously, but we're interested in rates of battery drain. And so what we'll do is we'll look at consecutive pairs of discharging samples and convert this into a rate. And so we do this by looking at the difference in time, the difference in the battery level. This gives us a discharge rate in percent per second. Then we take the two feature vectors and combine these two to get a feature vector for this discharge rate. And so we can condition the discharge rate on some set of features like, what does the discharge rate look like when the user has Wi-Fi access, or is running an iPhone 4, or something like that. So from these, we can build probability distributions, these conditional distributions of energy drain. So just to be clear, this is the probability distribution. And to be concrete about it, let's say that this feature vector is devices that are running Facebook. So this is the distribution of energy drain when users are running the Facebook app. So one way that we can characterize this distribution is by comparing it with other distributions. So in particular, we might be interested in the distribution when the users are not running Facebook. So if you see a pattern like this where the average energy drain when people are not running Facebook is significantly lower than when they are running Facebook. We describe this as an energy hog. This is a particular type of energy anomaly that we look for. And I'll explain what I mean by significantly more energy a few slides from now. OK, so this is a pretty straightforward characterization. This is just one pair of types of distributions that we can pair. Another thing we might be interested in is Facebook running on a particular device. So let's say we're looking at Facebook running on Ion's phone. Now, if we only had one device, there's no way for us to characterize whether or not this distribution that we see is normal. So this is a critical difference between Carat and other approaches. So if we have the crowd, however, we can say something about, well, what about Facebook running on other people's devices? Is it the case that running on Ion's device, it seems to be using way more energy than on other devices? So if that's the case, then we describe this as an energy bug. So this is just a particular terminology that we use to describe comparisons between an app running on one device versus an app running on other devices. So if it's using significantly more energy, then we call that an energy bug. So now, we'd like to do more than just compare these two distributions. We'd also like to say something about how confident we are in this comparison. So to do this, rather than comparing just the original distributions and looking at their expected values, which is what we were doing before, just looking at the distance between these two. Instead, we're going to look at the distribution of these expected values. Because these are computed from a sampling of some true distribution. And it turns out that these are normally distributed, which is nice. And so instead of comparing just the expected values, we can now talk about the relationship between these two distributions. And, in particular, we can quantify our error and confidence. So we have both an expected value, and given some confidence interval we can say something about that expected value being plus or minus some value. So we can say, the expected value is x plus or minus some error e with 50% confidence. And if we like to be more confident, we can increase the size of those error bars to some larger E and say, with 95% confidence, the true expected value is somewhere in this range. And there are two factors mathematically that contribute to this confidence interval. The first is the variance of the original distribution. So the more variance you see in energy drain of a particular app, let's say you watch Facebook running on a particular device and sometimes it uses a lot of energy. Sometimes it uses very little, and so on. You'll end up with a very wide spread of this distribution. It will have high variance, and that will increase the size of these error bars. And similarly, you have the size of the crowd. So basically, how much data you have. So if you decrease the amount of variance and increase the size of the crowd, if you do either of these two things, you expect your error bars to decrease. Basically, increasing your confidence. Because you either have more data, or the data that you do have is more consistent about what it says about the energy drain. And the opposite is also true. If you have higher variance or less data, than your error bars get bigger. So when we say that something is significant, we're talking about whether or not one distribution is far enough away from the other distribution in terms of this expected value in the error bars. That there actually is this sort of separation between these two error ranges. So if they're actually separated like this, then we can say with, let's say, 95% confidence, that one is, in fact, on average, experiencing greater energy drain than the other. We call that significant. And for the rest of this talk, it's implicitly 95% confidence. So one of things we'd like to do is not just quantify our error in confidence, but also increase the confidence that we have in the recommendations that we're giving. And one of the ways that we'll do this is by doing classification. So we can take a distribution and we can split it up into two other distributions conditioned on some feature. So we can say, this is a distribution. Let's say the original distribution was running Facebook. We can split it up and say, let's look at the energy drain when they had Wi-Fi versus when they did not have Wi-Fi. So by doing one of these splits, we're decreasing the amount of data that we have in each of these leaf distributions. And as I just said, this decreases our confidence in the recommendation that we're giving. But it may also be the case that performing one of these splits decreases the variance of these distributions. And it may do so significantly enough that actually, our confidence, our error bars, are smaller on these split distributions than on the original one. So if we see this sort of behavior, then we'll perform one of the splits. And you can build a diagnosis tree like this. So you can split on various features, sort of digging down until you actually find the conditions under which energy drain is significantly higher or significantly lower and so on. And these diagnosis trees allow us to compute diagnoses of the following kinds of forms. So you could say something like, killing some app a will give you x plus or minus some error e minutes of battery life with 95% confidence, as well as to supply alternative diagnoses. So just saying, well, you could get a similar effect if you upgraded the OS to some other version v. So I just want to pause right here. Just, are there questions about this part of the talk? Yeah. AUDIENCE: What's your standard delta t? ADAM OLINER: You mean the sampling interval? AUDIENCE: [INAUDIBLE]. ADAM OLINER: Yeah, So this depends on the operating systems. So on iOS, for example, you can't schedule regular intervals to wake up. So all we can do is subscribe to various notifications. And then we're at the whim of when those notifications get triggered. These are things like when the battery level hits 5 percentile marks, when someone plugs or unplugs the device, and so on. So there's more flexibility on Android in terms of when you can schedule these sorts of things. So it depends, basically, is the answer. On Android, it's still pretty infrequent. So once every 5 minutes or so. And I'll talk about what sort of overhead this incurs a little bit later. AUDIENCE: I can start Facebook [INAUDIBLE] and then it wouldn't get sampled? ADAM OLINER: That's right. That's right. So the granularity on any individual device is very low. And so one of the reasons that Carat is able to work at all is because we do this on a large number of devices. OK, so this segues nicely into the question of dealing with uncertainty. So there are lots of sources of uncertainty, and I'll talk about just a couple of those in this talk. So for one, I'll talk about the case of iOS and measuring the battery level. So the battery API on iOS is kind of interesting. So there's a particular type of notification called battery level changed. And when that triggers the battery life that it tells you the percentile is roughly accurate. So if it says 85%, then the true battery level is roughly around 85%. But for all other notifications, under all other circumstances where you ask the battery API, what is the battery level? It can be as much as 5 points above the true value. So it may say 85% and it could be anywhere between 80% and 85%. So when we were first starting to develop Carat and discovered that this was true, it was a little bit disheartening. We thought it might be a show-stopper. But we came up with this neat statistical trick that we used to deal with this kind of uncertainty. What we're going to do is we'll take the measurements that we believe to be accurate, these battery level changed events. And we'll use this to build a prior distribution. And this prior distribution is what we'll use to fill-in the sort of uncertainty gaps in the other information, as I'll now describe. So you observe the battery level over time. And let's say this blue line is the true battery level. And Carat, every now and then, is able to wake up and take these samples. And so the green dots here are going to be samples taken when the phone was discharging. And the red ones when the phone was charging. So we'll throw out the red ones. And now, let's look at some particular consecutive pair of these discharging samples. And as I said, this is how we compute these rate distributions. So if green is the value that the battery API returned, we know that it could be as much-- as low as the extent of this black bar that's hanging down from it. And so the actual rate of battery drain between these two samples could be as low as Y, meaning the low point of the first value and the high point of the second, or as high as x, the high point of the first one and the low point of the second one. And so these are our bounds of the low and high potential battery drain. And we use this to take a slice of the prior. And then once we have this slice, then we build a new probably distribution. And rather than account a single value as the rate of discharge between these two samples, we instead have a distribution of discharges. And this is based on the accurate measurements that we took from these other events, these battery level changed events. So this is one of the ways that we deal with uncertainty, is by using the statistics that we have from other measurements, both from that phone and from other phones to essentially fill-in these gaps. OK, so another question you might ask is, whether these sort of tricks work. Does the sampling that we get from Carat match the reality of energy consumption? And also, how much this Carat cost to run? So if you have this going on your device and taking these samples, are you draining your battery just as badly as whatever the offending app was? So to test this, we did a couple of things. So one was that we got this Monsoon Power Monitor, which is a neat little device that you can use. And actually, the previous slide shows it on the left here. And we rigged it up to an iPhone, so that we could measure the actual rate of energy drain. And we also partnered with a company, a battery company called Leyden Energy down in the South Bay here, which has a bunch of battery testing equipment. So they can hook up devices to their machines and run various usage scripts. And so for the iPhone and a Galaxy Tab 2, we did these experiments where we would run 8 to 10-hour usage scripts, doing things like browsing the web, email, and so on. These were repeatable scripts. And tested what the battery drain was, as well as looking at Carat taking samples. And then a third experiment where Carat was not running, but we ran through the same usage script. And so we'll use these to quantify how accurately Carat is able to measure the energy consumption, and also how much energy Carat is taking up. Yeah? AUDIENCE: First, can you repeat the question [INAUDIBLE]? Have you guys just published these stats by themselves? This seems generally interesting to people [INAUDIBLE]. ADAM OLINER: Your question is, have we published the stats? The stats being which stats? AUDIENCE: [INAUDIBLE]. ADAM OLINER: Oh, just like, what are the raw numbers of how much energy gets consumed by various things? AUDIENCE: [INAUDIBLE]. ADAM OLINER: So we've mostly been focusing on the app developers at this point. We've talked a little bit with folks at Apple. And I'm here, hopefully, to talk to the Android people. But it seems like they don't care about energy or something. I don't know. If they care, they can talk to me. I did my part. So the customer service at Monsoon was fantastic. So we had some trouble getting the iPhone hooked up to the device. As it turns out, Apple products are not intended to have wires soldered onto them and whatnot. They're not designed to make that easy. And so they actually drew us this hand diagram. In the picture, I guess that's like a TI-82 running Windows, or something like that. But it's supposed to be an iPhone. Anyway, so it worked. And that was pretty exciting. OK, so just to give a little bit of a feel for some of these numbers, obviously there's a lot of data here and I'm not going to talk about all of it. But the first high-level bit is that the values that we see on the battery indicator that's returned by the API agree with what you see coming out of the power monitors. So these are just numbers that, more or less, track each other. The second high-level bit is that you get pretty good accuracy in terms of using the prior to compute what the energy drain distribution is. So this orange curve is the discharge distribution as estimated by Carat. And the green and black ones are the distribution as measured by the numbers that we got out of the power monitor. And they don't match exactly, but the thing to recognize is that this is from one device over a single day. This is data from the iPhone. And so over the course of that experiment, Carat took nine samples. It only measured the battery level nine times. But using the prior, was able to get this accurate of a distribution. The other high-level bit here is that running with and without Carat is basically the same amount of battery. And just to give a little bit of quantities to these, during the experiment we did with the Galaxy Tab, Carat only underestimated the battery drain by 0.00015% per second. On iOS, it was a little bit worse. So these were fairly good given that they only had less than a dozen samples compared to the thousands and thousands that the power monitors were Measuring. And the second is that Carat is extremely low overhead. So during the experiment where Carat was not running, in its place we ran the standard Weather app. And Carat uses less energy than the Weather app. Weather app uses 3.5% more of the battery over the course of this experiment than Carat. So that was the ground truth experiments. I'll talk now a little bit about the implementation, focusing on the back end. So as I said before, this is what the Carat infrastructure looks like. You have the crowd, which communicates data to a central server, which is then sent to the back end analysis. And mostly today, I'll be talking about this back end. So to implement the analysis, we used a language which was developed at the AMP Lab at Berkeley called Spark. Spark is a cluster computing framework that uses a structure called Resilient Distributed Datasets. And the merit of these distributed datasets is that they reside in memory, and they persist across runs of jobs. And so the idea is that rather than, like a MapReduce job where you have to reload everything into memory, if you want to run it again, they instead stay resident in memory. And the intention here is that you will run iterative jobs or interactive workloads. And so, this is one of the merits of the RDDs. They also provide various other features, such as the ability if a node crashes, to efficiently recompute the data that was lost. And so it's actually quite a cool language. And we've had a lot of success with it. And I'll show you some of the parallelization numbers in subsequent slides. So one of the main challenges that we faced when trying to parallelize this was the process of converting the samples to rates, and then the rates to distributions. And to get the samples to rates, one of the main challenges there is that we had this inter-sample dependency. Where, in order to compute a rate, we needed these consecutive pairs. But then the subsequent rate needs the second sample from the first one and so on. And so one of the things we had to do was convert everything into these consecutive sample pairs and use those as kind of the unit of information. So there was some replication of data going on there, but it removed this dependency that we had between samples, and let us paralyze much more efficiently. And then this is a slide that's showing a little bit about how we do the conversion between rates and distributions, just to give you a sense of what a workflow might look like in Spark. And it should look very familiar to you. It's operations like map, and reduce, and group by. And so this uses the same sort of language that you see in MapReduce operations, except Spark supports general graphs instead of just specific kinds of graphs. And so we were able to leverage the Spark parallelism in order to do this and use RDDs to good effect. And I won't talk too much about the details here. I think it's fairly self-explanatory. And if there are questions about it, then I can field those individually. So Carat, as you might imagine, is very low traffic. So any individual device is not communicating very much information to the servers. On average, it's way, way less than a byte per client per second. And this means that with relatively few servers, we can handle all of the both incoming and outgoing traffic that all of our users generate. This is currently being handled by 5 AWS instances running with a load balancer in front. But really, this is overkill. So the server itself is not a bottleneck for us. I'm just sort of showing you this to say, don't worry too much about the central server. The main issue is dealing with the analysis. So parallelizing the analysis was absolutely essential. We had done a sort of naive parellelization of it when we started out with our initial sort of lab distribution of a few dozen users. But as I'll describe later, this fairly quickly exploded and we had to do a proper parellelization of it in order to make it scale. So just to compare here, this is an optimized serial implementation. And on the x-axis, you have the number of samples. So the red is the optimized serial implementation. And then the green one is when you actually parallelized this in Spark. So essentially, by the time we have the number of users that we currently have, which is about 340,000, the analysis would have taken more than a day to run if we had stuck with a sort of serial implementation. In Spark, we can do the whole thing in about 45 seconds from scratch. So I'm going to spend the rest of the talk describing the deployment that we have, and explaining some of the results. Just giving some examples of some of the anomalies that we've seen and so on. Yeah, Phil. PHIL: So as a user, what is the significance of the [INAUDIBLE]? Do I have to go to the server and wait [INAUDIBLE]? ADAM OLINER: Yeah. So we're changing how this works. So I'll explain the process of how from a user's perspective they interact with the Carat app. So currently the way it works is you start up Carat and it basically just doesn't have any information for you. It will start taking samples and sending stuff to the server, but we'll just say that it has no results yet. And this was true initially just because we had so little data that we couldn't bootstrap it on anything. But now that we actually have users, the intention is that you can immediately report what apps that user is running, and give back to them information about the hogs. So at the very least, we can populate the hogs list when they first open the app. And this is something that we intend to do soon. The bugs, on the other hand, require us to observe that particular device over time. So the speed of this is not bounded by the analysis speed in terms of, how quickly can we tell them about bugs? It's really just a function of, how quickly can we get enough data from them to have enough confidence that we can say something about which of these apps are misbehaving? And so that's really the limiting factor. The reason that having an analysis that runs quickly is important is that, first of all, it wasn't like 45 seconds versus a minute and a half, or something, that was the issue. It was that it was taking more than a day for people to start getting any sort of results. And that was frustrating them, and us. Another thing that's nice about having it be only 45 seconds is that these diagnosis trees that we build, we can go much deeper and start to look at things like combinations of features as opposed to just individual features, and splitting based on those. So all of these basically give us more flexibility in terms of what sort of situations we can look at. All right, so initially, back in I guess late January early February, we had a little iOS implementation and we distributed it to a small number of people. You're allowed to basically get 100 people to sign up for this. This is a cap that Apple imposed . Of those people, 75 installed. The 25 who signed up and then did not install, unfortunately, we couldn't reuse their spots. You can't recycle unused spots for these sort of beta distributions, which is frustrating. But so this initial deployment was 75 people. And from these 75 people, over a couple of weeks we collected about 10,000 samples. So the first thing to know here is this is a really small initial deployment. There's very little data. And so despite that fact, we're able to find 35 apps that exhibited energy bugs, anomalies in general. But 35 in particular that exhibited energy bugs. And these were in popular apps. So things like Facebook, and Kindle, and Flipboard, which is a digital magazine app. And we were able to corroborate these things with forum posts, and news articles. And even, in some cases, the implications of some of these forum posts was that there were particular features that were triggering these misbehaviors. So interactions with a particular OS version, or issues when they didn't have Wi-Fi access, and so on. And we were able to actually corroborate those with the data that we had. So despite this relatively small deployment, we were already starting to find some interesting energy anomalies. So this was a very encouraging initial result. We also tried doing some synthetic bug injection. So we took the Wikipedia mobile app and we wrote in some behaviors, some misbehavior that we could manually trigger that would cause the app to use a lot of energy. Basically, abuse a resource like the GPS or the CPU, and so on. We installed this on one of the devices in the deployment and then tested whether when we triggered that misbehavior, whether Carat came back and said, it looks like Wikipedia running on that particular device is exhibiting an energy bug. And in all three cases, it did so. And in no other cases on any of the devices running Wikipedia mobile did it say that it was anomalous. So essentially, it did the right thing in this injection experiment. So both of these initial results were pretty promising. And so we developed an Android version and submitted both of these things to the respective app stores. And then, after a bit of a delay, finally in mid-June, they both got in to their respective app stores. And then, we got featured on TechCrunch, which kind of took us by surprise. And it was very exciting, so we were pretty thrilled that we might go from like 75 users to some larger number. Then Lifehacker picked up the story from TechCrunch, as did dozens of other news sources. So within 24 hours, there were at least dozens of articles. And we rocketed to more than 100,000 users in less than a day. So I basically woke up, and then chucked Flurry and said, oh, no, that's a lot. And then, within the 24 hours after that, we went to 200,000 users. And so the rest is history. So today, we have about 340,000 devices. Most of those are iOS, but Android is gaining quickly. I think mostly the reason that iOS started out ahead is that the TechCrunch readership, I think, is a little bit biased toward Apple devices. I can't back that up, that's just my impression of the crowd. And so far, at last count, we have something like 20 million samples. So this is a far cry from the 10,000 that we used to have. And obviously, there's a lot more that we're finding in this data. So to give you a sense of what these numbers look like, what sort of energy anomalies we're seeing, about 9.4% of the reported apps qualify as energy hogs. This was about 11,000 apps. These include the ones that you might expect, like Pandora and Skype, things that use a lot of resources. But what's surprising is that some of the energy hogs are not apps that you would expect to be energy hogs. So, for example, there were a family of basically, airline search-related apps. So you could check whether there were flights available and so on, were using just far more energy than they ought to be using compared with other similar apps. And also, we found lots of instances of energy bugs. So about 5.3% of the instances of apps running on devices qualifies as buggy instances. These include things like Kindle, and Facebook, and YouTube. And I'll describe a couple of examples of energy bugs in subsequent slides. And obviously, I can't even scratch the surface on some of the interesting stuff here. So I'm just going to give a couple of examples. If you're interested in seeing examples of energy hogs and energy bugs, I encourage you to run Carat on your device. You'll probably find some. So one example of an energy bug that we found was Kindle running on iOS. I particularly like this example. I think it was really interesting. So this was reported as a bug on 3.9% of the clients. So a relatively small fraction of the people running the Kindle app were seeing way more energy drain than the other users. And this has already been sort of discovered by frustrated users who were running the Kindle app. And they were complaining on the forum. And one of the theories that was pushed forward was that it was related to this WhisperSync protocol. So if you've used one of these Kindle apps before, as you know, it synchronizes your bookmarks and the books that you've purchased-- obviously, annotations and so on-- between your various Kindle devices. So it turned out that there was a bug in this implementation, a bug in the protocol. Such that when you did not have access to Wi-Fi, the Kindle app would use significantly more energy than if you did have access to Wi-Fi. And I thought this was interesting because it's kind of counterintuitive. I would think that you would use more energy when you're using an energy-hungry resource, like the Wi-Fi, than when you're not. But it turned out to be the opposite. And it was a fairly significant difference. So if you just turned on Wi-Fi when you were using the Kindle app, you'd get another 36 minutes of battery life. And this is a statement that we can say with 95% confidence. That for a typical buggy user of Kindle app, that's what they will see. And the diagnosis tree that we build looks something like this. I'm just showing a part of this tree to give you a sense of the sort of information that we can compute. So this is saying the battery life without Kindle versus running with Kindle. One thing that's interesting is that actually you get better battery life running Kindle than when running standard other apps. So when I say without Kindle, that means running an average of all the other apps that someone might run. So in other words, on the whole, Kindle is a very energy-efficient app. [INAUDIBLE] displaying something on the screen. It's not like you're playing a game that's using a lot of the CPU. It's not constantly using the network and so on. So it's actually not an energy hog, which is important to recognize. So it's actually a fairly energy-efficient app in general. But it turns out that during this WhisperSync process, it uses a lot of energy. So you can see this by looking down the next level on the tree, which describes network connectivity. So this with Kindle is 8.4 hours is the sort of agglomeration of the three distributions below it-- when the network's off versus when you have these two different types of connectivity. So what this is saying is that when the network's off and it's not trying to do this synchronization, you get pretty good battery life. It's using basically no energy. But when it's actually doing the sync, it's energy-inefficient. And in particular, it's less energy-efficient using 3G than when running Wi-Fi. And this is the sort of surprising aspect of this diagnosis, to me, at least. So before I move on, is there a question about this? AUDIENCE: The bug on the previous slide that you were talking about, wouldn't that-- sorry. How is that reflective in this [INAUDIBLE]? ADAM OLINER: So the difference between running with the 3G versus running with the Wi-Fi. So essentially, if you're in the 3G node and you turn on Wi-Fi, then you move over to this Wi-Fi node. And so that's the difference in energy that you would expect to see during the synchronization process. AUDIENCE: That makes sense. You expect to get more battery life Wi-Fi than on 3G. ADAM OLINER: Do you? AUDIENCE: 3G radio take more power. ADAM OLINER: But your 3G radio, it's on anyway. AUDIENCE: I might not have a good [INAUDIBLE]. ADAM OLINER: Yeah. AUDIENCE: It's not surprising me. ADAM OLINER: Well, some of these resources are complicated. So like the radio, for instance, has tail energy that you need to consider. So if something else uses the radio, then you pay for that tail energy for some period of time following the actual use of the resource and so on. So all of these computations that we're doing are just talking about averages. So it's very hard to say for a specific device what's likely to be the energy consumption. Because there are lots of kind of chaotic things going on. But as a sort of general statistical statement, users who have Wi-Fi access are getting better battery life. Maybe I've naive to be surprised by this. AUDIENCE: That's exactly what they said. ADAM OLINER: OK. Well then, there you go. So turn on Wi-Fi, kids, I guess is the moral. OK, so I'll move on. So another example is the Twitter app on Android. So most of you are probably familiar with Twitter. This was reported as a bug on about 15% of the clients running it. So this is a relatively large fraction, I think. If you look at the diagnosis tree, which we sometimes call an MCAD for reasons that I won't go into, it implicates the operating system version. So in particular, what it suggested was that users running Ice Cream Sandwich 4.0.4 were seeing 94 minutes more battery life than Twitter users who were running any other version of the operating system. So there was something going on with Twitter interactions with these other versions of the operating system that was causing a significant drain on the battery life. So this was a serious problem. It also turns out that Wi-Fi helps. Maybe. This is now becoming like a thing. Get on the Wi-Fi network ASAP. So this is interesting. So this is another actionable thing that you can do. So in addition to the recommendations of killing an app, or restarting an app, Carat sometimes will tell you, upgrade your operating system. And currently, that recommendation, at least on the UI side, is only coming about if you see that users of a particular version of the OS, as a whole, are getting much better battery life. But on the back end, there are all sorts of things that we're computing that would allow us to say, you're a Twitter user. And so if you're going to use Twitter a lot, you should definitely upgrade because it seems to make a big difference. So we're computing all of this on the back end, we just haven't been focusing on the UI recently. Such is research, I suppose. OK, is there a question about this story before I move on? OK. Cool. So in aggregate, what sort of effect does Carat seem to have on battery life? So I'll describe this plot, and then there are a bunch of caveats in trying to talk about this' sort of data that I'll describe as well. So the x-axis here is the days since the first report that Carat gives the user. So user uses it for a while, and then at some point, Carat thinks that it has enough information about that user to report back and say, here are some energy hogs and some energy bugs on your device. So 0 here is the day that Carat has made that decision and sent out that information. It does not necessarily reflect anything about when the user first read those pieces of advice, when the user first did anything about them, and so on. All of that is very hard for us to measure from our side of the fence. But in aggregate what it says is that after 10 days, the average Carat user sees about 10% more battery life. And then, after 90 days, it suddenly [INAUDIBLE] up to about 30%. So this is a significant improvement in battery life. There are obviously a number of caveats with this. So the first is that we have a very biased sample of the population. This is users who sought out and installed, and then continued to run Carat for 90 days or more. So frequently, these users had pretty bad battery life to begin with. So the baseline may have been quite low. The second thing is that users typically will not do a nice controlled experiment for us where they will change no other behaviors except running Carat, which is unfortunate, but that's what we got. So there you may have been other things that these users were doing during this period of time that caused their battery life to improve. So one of the things that they could have done even is, actually replaced the battery with a new one at some point during this process. So this is aggregated over hundreds of thousands of users, but there are various confounding factors that we just don't have the data to disambiguate. All we can say is that in a general sense, it seems like users of Carat are improving their battery life by double digits, even after relatively short amounts of time . Yeah? AUDIENCE: How wide is the spread variation? ADAM OLINER: I don't have those numbers for you, unfortunately. We did compute them, but I just don't remember them. AUDIENCE: [INAUDIBLE]? ADAM OLINER: The 95 percentile band. So it starts out and it's pretty thin. It gets a little bit bigger as you go out in the graph because there are fewer users. That's sort of why you see this jaggedness. There are fewer users who have been running it for that long, just because we haven't been around for that long. I mean, 90 days is almost the entire lifetime of the project. OK, so one thing I talked about before is this convergence of the error bars, the increase in our confidence as we get more data. And so this is an example of this happening. This is an aggregate of the error bars and the expected value estimates for several hundred of these energy hogs. So on the x-axis, you have the number of samples. And so again, we have like 20 million samples, or something. So these are really small numbers that we're looking at here in terms of how many samples. And on the y-axis, you're looking at the relative expected value, which is-- so 0 would be a perfect estimate of the true expected value based on all of the data as opposed to just this sampling of it. And then the error upper bound and lower bound are these dashed dotted lines. Essentially what you're supposed to see here is that those two converge very quickly, so your error decreases rapidly, as does the accuracy of your estimate of the expected value. This is very helpful to us. I guess this is the point. So this is a property that we relied on and are glad to see in practice. OK, and then the last thing that I'll mention in terms of the data here is the prediction accuracy. And this graph requires a little bit of explanation because it's actually very hard to visualize this statement. Essentially, what's happening as that Carat is saying, if you kill this app, you'll get, let's say, an hour more battery life, plus or minus 10 minutes with 95% confidence. Let's say that a user sees that kind of a statement. So what we'd like to do is assess the accuracy of that statement over a large number of users, over a large number of these recommendations. So if you had perfect prediction what that would essentially mean is that if a user-- an hour plus or minus 10 means if they went from using it all the time to using it never, they should see an hour plus or minus more battery life. We treat this as a linear relationship for the purposes of assessing its accuracy. And so if they go from using it all the time to just using it, say, 50% of the time, then they should see half of that benefit. It should go 30 minutes plus or minus instead of on hour plus or minus. And so, this is essentially describing-- this hour number is essentially describing the slope of a line. And so we can characterize the slope that users see in practice when running these apps. So we can say, the user running this particular app, when he goes from using it 60% of the time to 20% of the time, how does his battery life change? And so you can do this over a large number of users for a large number of user recommendations. So the green line here would be perfect prediction. That would be the slope if we get it exactly right. And each of the other lines is representing a particular recommendation for a particular user. So one of these lines might be Dan's phone running Facebook. And we told him 45 minutes, so it should be right on that green line if it were perfect protection. And the closer these lines are to the green line, the closer we were to being correct. So the gray lines here are ones that were within the 95% confidence bounds for that particular recommendation. So if we told him an hour plus or minus 10 and it was an hour and 7 minutes, then that would be within the confidence interval, within the error bounds, and so that would show up as a gray line here. And the orange lines are ones that fell outside of the confidence interval. So we measured this for a large number of these recommendations and for the 95% confidence bounds, 95.4% were within those bounds. So this is exactly what one would hope. Just under the wire there, right? So 95% would have been the lowest that we could go. So this was a very good result for us. We were pretty excited about it. Basically what it means is that we're able to not only say with pretty good accuracy what sort of battery life improvements users should expect to see, but even the error bounds that we were computing seemed to encompass the right fraction of the users. Yeah? AUDIENCE: [INAUDIBLE]. Is that just a one-to-one thing, that slope of the line? ADAM OLINER: It's a relative slope. I wanted to get a lot of information into one graph. So the idea here is-- AUDIENCE: [INAUDIBLE]. It's not like the other slopes are different slopes. They're all-- ADAM OLINER: They are, actually. AUDIENCE: They are different slopes? ADAM OLINER: Yeah. They are slightly different slopes. It's a difficult graph to describe. I'm trying to figure out better ways of visualizing this sort of information. But the high bit here is that bullet point, which is that for a 95% confidence bound, more than that are within that error. So it was good. OK, so I'm just going to wrap up here by talking a little bit about what we're working on now. So the first is we're still working to improve the error and conference bounds that we're giving to people. So we'd like to be able to give more complicated diagnoses and still be able to attach these sort of confidence intervals. So telling a particular user that when you're running Twitter, you should turn on Wi-Fi or something like that. Or upgrade to Ice Cream Sandwich 4.0.4. Another thing that we're looking to do is build an API for developers. So if we, for example, see that Twitter is exhibiting an energy anomaly on a particular subset of the devices, we don't know, for example, whether that subset of devices just has a particular setting inside of the app, or if the user is just interacting with that app in a particular way. That's not something that's visible to Carat. So one of the ways that we can get around this is if the developer of that app adds a little bit of instrumentation, something that you see in, like Flurry, or similar sorts of libraries. And so this would let us know, what are the settings of that app and so on, and give us additional features that we could mine for relationships between energy consumption and those features. So this is part of a sort of vision that we have for doing this sort of collaborative debugging, just deploy to the crowd and debug in the cloud. There's nothing specific to the analysis about energy. So that semantics is entirely visible to everything that the back end does. So there's no reason that we can't also use this for performance debugging, for example. Or really, the consumption of any of these resources. And more broadly, we're starting to look at this as statistics as a service. So we've been talking about applications of this to other domains. For example, one of my favorites is the body metrics. So a lot of people have like sleep monitors, and running monitors, and so on. And you could ask similar questions about those sorts of data. So compared with other people who have running habits like I do, how do my sleeping habits compare? Why am I not getting enough sleep? How do the people who are not getting enough sleep, what are the dimensions in which they differ from the people who do, and so on. So anyway, it seems like there are a lot of interesting questions that you can ask of the kinds of-- of the form of the questions that Carat asks. And so we're starting to build up sort of statistical service, so that you can start to answer those questions. If you want to download Carat, either for iOs or Android, there's a link at that URL. The client code for both platforms is on GitHub. So feel free to fix any bugs that you find and help us improve the app. We certainly are shorthanded, so we appreciate the help. So go download it. We're keen for feedback. And we'd love to talk with some folks here about some more ideas. Anyway, that's all. Questions? Yeah. AUDIENCE: Are the decisions trees that you showed automatically generated? What do you use to [INAUDIBLE]? ADAM OLINER: They are. So the question was, how do we build those decision trees? What's the algorithm for exploring the feature space? And that's a rapidly changing question. So currently, it's fairly brute force. The feature space that we're considering is a little bit constrained, so we're only looking, for example, at Wi-Fi access, the device model, operating system version, and a couple of other things. So doing a brute force exploration is still tractable. In the longer term, we're obviously looking at ways of prioritizing how to explore the space, et cetera. And so I'd be happy to talk with you about that, actually, if you have ideas. Yeah? AUDIENCE: I have a contentious idea for getting more blog attention. [INAUDIBLE]. ADAM OLINER: So the question was, could we get more attention by shaming people into being aware of various energy problems and so on? Yes, we could. So the issue here is that it's basically three people working on it, one of whom is international. And none of whom, as a primary job, is trying to build a product or garner attention and so on. And so there are lots of cool things that we have in our heads that we would love to be doing, we just don't currently have the people to do it. So that's a problem that we're trying to solve, but-- yeah? AUDIENCE: And you can only get enough data at this point, too, to [INAUDIBLE]. ADAM OLINER: Yeah, we have more data than I ever expected us to have. So yeah, it's really exciting, actually. All right, cool. Thanks, guys.