Tip:
Highlight text to annotate it
X
PHIL: I'll introduce Adam Oliner, who got his PhD last
year at Stanford, and he's going to talk about some
research that he's been doing as a post-doc at UC-Berkeley.
ADAM OLINER: All right, thanks, Phil.
So this is joint work with Anand and Ion at the AMP Lab
at Berkeley, and a couple of people at the University of
Helsinki, Eemil and Sasu.
So mobile is a hot topic as you know.
Sometimes this is literally true.
This is an article that someone--
well, it's a letter that someone sent into Lifehacker
complaining that a phone, which they believed to be
sitting idle in their pocket, was actually getting hot.
So it was clearly doing something even when it wasn't
supposed to be.
And this user is asking questions like, how do I make
this stop and why is it doing this?
And so, anyone who has a smartphone has probably had a
similar experience.
So they start their day and the phone is fully charged.
And they maybe don't do some work for a while.
It sits idle for a bit.
Maybe they actually get something done.
It sits idle.
And at some point, they take it out of their pocket and it
complains that the battery's been depleted.
This, I think, is a pretty much universal experience for
people who have these kinds of devices.
And for the user, there are I think three primary questions
that they'd like to answer.
First is why the battery is draining.
Is it something that they did.
Is it an app that they're running?
And then second question, one which Carat is uniquely
positioned to answer as you'll see over the course of the
talk, is whether or not that drain is normal.
So if the user does the same thing tomorrow, should they
expect to see similar battery drain?
If another person who has a similar phone does the same
behavior, should they expect to see similar battery
drain, and so on.
And then finally, what can the user do about it?
Is there a behavior that they can change or a setting that
they can change to improve their battery life?
So today I'll be talking about a tool that we built called
Carat, and a method for analyzing data that enables us
to answer those three questions that we were talking
about in the previous slide.
I'll describe how we go from the samples, the state samples
that we take on individual devices, and how we aggregate
that and use that to do energy diagnosis
for those same users.
I'll talk about some of the ways we deal with uncertainty
in practice.
So as you measure things on devices, there are lots of
sources of uncertainty and imprecision.
And I'll describe some of the ways that we deal with those.
I'll talk about our implementation, mostly on the
analysis side.
And then finally, describe our deployment
and some of our results.
OK, so just to give you a little bit of background,
prior approaches had been pretty much strictly ad hoc.
So there was some work out of Purdue that worked with
looking for specific kinds of misbehavior.
So, in particular, on Android they were looking at something
called a no-sleep bug.
This was a violation of a locking discipline that would
prevent the screen from going to sleep.
And so if your energy problem was not related to a no-sleep
bug, that approach would not help you to figure out what
was going on.
Many of the previous
approaches were also intrusive.
They would require you to instrument the operating
system, or have access to the source code of a particular
app that you were trying to debug.
And this is frequently prohibitive, especially for
lay people.
And then finally, you can find dozens and dozens of apps on
pretty much every app store that claim to help you with
your battery life.
And almost universally, these are generic pieces of advice.
They'll tell you, kill all of your background apps or dim
the screen.
Basically, just don't use your phone as much is essentially
the advice that they give, which is
also not very helpful.
It's not actually telling you what was going on and whether
or not that was surprising behavior.
So our approach looks like this.
This is a kind of overview of the infrastructure.
We collect data.
Rather than from a single device, we collect it from the
crowd, aggregate that in the cloud, and do a statistical
analysis on it there.
And then return to the user's information for their
particular device and their behaviors what seems to be
causing the energy problems.
And whether or not that's normal.
So as far as we're aware, this is the first collaborative
approach for diagnosing these sorts of energy problems.
So we built Carat as a mobile app for both iOS and Android.
And it provides personalized energy debugging.
So what it will tell you is what's misbehaving, whether
that's normal, what you can do about it.
And those are things we've mentioned before.
And furthermore, Carat will also try to quantify how much
it will help.
And I'll describe how we do that.
So the design goals of this project were to take a
particular design point in the space.
And so we've chose the space that was the most invasive
method that works on both iOS and Android.
So there are things that you can do that are specific to
either platform, but we wanted to do, what is the sort of
most you can do that works on both of these?
And so the primary constraint there was, what was eligible
for Apple's App Store?
That was sort of the upper bound on what we
were able to do.
And so the goal here is to just investigate how far we
can take diagnosis given only that sort of information.
So there are lots of questions related to, well, what if you
measured this?
What if you measured that?
And we're looking at those, too, but that's not the goal
of this particular stage of the project.
And I'll be talking today just about Carat as a method that
works on both of these platforms.
So this is what some of the screens of Carat look like.
On the left and on the center, you see what's called the
Action List for iOS and Android.
And what this tells you is, based on the apps that you're
running, which ones should you kill or attempt to restart, so
that you can improve your battery life.
So if it says, say kill Pandora, what it's suggesting
is that Pandora is running on your device.
And that it seems to be consuming a lot of energy.
And in particular, it's estimating that you'd get
about an hour and a half more battery life if you killed it.
Which is significant.
On the right side, you see one of the other screens, which is
the Device Screen.
And this gives you a little bit of information about how
long you should expect your battery life to last as you
currently use it.
And finally, something called a J-Score.
The J-Score tells you your sort of percentile battery
life relative to other similar users.
So this particular screen is from a device that was in the
64th percentile.
Meaning 64% of the other devices got worse battery life
than this particular device.
So sometimes people would complain that they were
experiencing battery problems, but had a really good J-Score.
And this essentially meant that they thought they had bad
battery life, but relative to other people, they were
actually doing pretty well.
And vice-versa.
There are people who have extremely low J-Scores and
seem really happy with their battery life, which I guess is
the merit of low expectations.
There were a lot of privacy concerns when we were first
developing the app.
So people expressed a little bit of wariness about, well,
what sort of information are you going to be reporting?
There have been previous incidents about things like
Carrier IQ and so on that reported rather large amounts
of personal information.
And so we went to great pains to make sure that we were not
collecting anything that was personally identifying, or
otherwise endangering people's privacy.
In the end, it turned out that people say that they care
about privacy.
But in practice, it doesn't seem that they do.
And actually, we would've, in retrospect, liked to have
collected more personally identifying information so
that we could help individual users with support.
So we would get emails from people saying, hey, I want to
understand this particular set of recommendations, but
because we collected nothing that ties that person to the
data that we've collected from them, we have no way of
looking up their data and telling them anything.
AUDIENCE: So what's like an easy thing to collect that?
Give me the benefits.
ADAM OLINER: You mean, what is an example of a personal
identifying piece of information?
I mean, we could literally just say, hey, could you type
in your email address.
And then we can tie that to the records and then be done.
And we'll probably do that in the later version, just
because it seems like almost everyone would be willing to
provide that information.
And it would be helpful to us.
At any rate, so this was something that we were
initially concerned about, but it turns out that we probably
should have been more liberal about what we collect.
OK, so now, that's a kind of overview of
what Carat looks like.
And I'll start to dive in here and describe the process of
how we go from the data that we collect on the device to
the diagnosis.
So the way Carat works is that periodically, it will wake up
and get some time to write down information about the
state of the device.
So in particular, it'll write down the time stamp.
It'll write down the battery level.
And then a feature vector that includes things like what apps
are running, does the device have Wi-Fi access, what
version of the operating system is it
running, and so on.
And it doesn't just collect one sample, it collects
samples over time.
So in this case, you see that the battery in the second
column there is depleting over time, certain apps are being
closed or opened, Wi-Fi access is intermittent, and so on.
So this tells us what the battery level is
instantaneously, but we're interested in
rates of battery drain.
And so what we'll do is we'll look at consecutive pairs of
discharging samples and convert this into a rate.
And so we do this by looking at the difference in time, the
difference in the battery level.
This gives us a discharge rate in percent per second.
Then we take the two feature vectors and combine these two
to get a feature vector for this discharge rate.
And so we can condition the discharge rate on some set of
features like, what does the discharge rate look like when
the user has Wi-Fi access, or is running an iPhone 4, or
something like that.
So from these, we can build probability distributions,
these conditional distributions of energy drain.
So just to be clear, this is the probability distribution.
And to be concrete about it, let's say that this feature
vector is devices that are running Facebook.
So this is the distribution of energy drain when users are
running the Facebook app.
So one way that we can characterize this distribution
is by comparing it with other distributions.
So in particular, we might be interested in the distribution
when the users are not running Facebook.
So if you see a pattern like this where the average energy
drain when people are not running Facebook is
significantly lower than when they are running Facebook.
We describe this as an energy hog.
This is a particular type of energy anomaly
that we look for.
And I'll explain what I mean by significantly more energy a
few slides from now.
OK, so this is a pretty straightforward
characterization.
This is just one pair of types of
distributions that we can pair.
Another thing we might be interested in is Facebook
running on a particular device.
So let's say we're looking at Facebook
running on Ion's phone.
Now, if we only had one device, there's no way for us
to characterize whether or not this distribution
that we see is normal.
So this is a critical difference between Carat and
other approaches.
So if we have the crowd, however, we can say something
about, well, what about Facebook running on other
people's devices?
Is it the case that running on Ion's device, it seems to be
using way more energy than on other devices?
So if that's the case, then we describe this
as an energy bug.
So this is just a particular terminology that we use to
describe comparisons between an app running on one device
versus an app running on other devices.
So if it's using significantly more energy, then we call that
an energy bug.
So now, we'd like to do more than just compare these two
distributions.
We'd also like to say something about how confident
we are in this comparison.
So to do this, rather than comparing just the original
distributions and looking at their expected values, which
is what we were doing before, just looking at the distance
between these two.
Instead, we're going to look at the distribution of these
expected values.
Because these are computed from a sampling of some true
distribution.
And it turns out that these are normally distributed,
which is nice.
And so instead of comparing just the expected values, we
can now talk about the relationship between these two
distributions.
And, in particular, we can quantify our error and
confidence.
So we have both an expected value, and given some
confidence interval we can say something about that expected
value being plus or minus some value.
So we can say, the expected value is x plus or minus some
error e with 50% confidence.
And if we like to be more confident, we can increase the
size of those error bars to some larger E and say, with
95% confidence, the true expected value is somewhere in
this range.
And there are two factors mathematically that contribute
to this confidence interval.
The first is the variance of the original distribution.
So the more variance you see in energy drain of a
particular app, let's say you watch Facebook running on a
particular device and sometimes it
uses a lot of energy.
Sometimes it uses very little, and so on.
You'll end up with a very wide spread of this distribution.
It will have high variance, and that will increase the
size of these error bars.
And similarly, you have the size of the crowd.
So basically, how much data you have.
So if you decrease the amount of variance and increase the
size of the crowd, if you do either of these two things,
you expect your error bars to decrease.
Basically, increasing your confidence.
Because you either have more data, or the data that you do
have is more consistent about what it says
about the energy drain.
And the opposite is also true.
If you have higher variance or less data, than your error
bars get bigger.
So when we say that something is significant, we're talking
about whether or not one distribution is far enough
away from the other distribution in terms of this
expected value in the error bars.
That there actually is this sort of separation between
these two error ranges.
So if they're actually separated like this, then we
can say with, let's say, 95% confidence, that one is, in
fact, on average, experiencing greater energy
drain than the other.
We call that significant.
And for the rest of this talk, it's implicitly 95%
confidence.
So one of things we'd like to do is not just quantify our
error in confidence, but also increase the confidence that
we have in the recommendations that we're giving.
And one of the ways that we'll do this is by doing
classification.
So we can take a distribution and we can split it up into
two other distributions conditioned on some feature.
So we can say, this is a distribution.
Let's say the original
distribution was running Facebook.
We can split it up and say, let's look at the energy drain
when they had Wi-Fi versus when they did not have Wi-Fi.
So by doing one of these splits, we're decreasing the
amount of data that we have in each of these leaf
distributions.
And as I just said, this decreases our confidence in
the recommendation that we're giving.
But it may also be the case that performing one of these
splits decreases the variance of these distributions.
And it may do so significantly enough that actually, our
confidence, our error bars, are smaller on these split
distributions than on the original one.
So if we see this sort of behavior, then we'll perform
one of the splits.
And you can build a diagnosis tree like this.
So you can split on various features, sort of digging down
until you actually find the conditions under which energy
drain is significantly higher or significantly
lower and so on.
And these diagnosis trees allow us to compute diagnoses
of the following kinds of forms.
So you could say something like, killing some app a will
give you x plus or minus some error e minutes of battery
life with 95% confidence, as well as to
supply alternative diagnoses.
So just saying, well, you could get a similar effect if
you upgraded the OS to some other version v.
So I just want to pause right here.
Just, are there questions about this part of the talk?
Yeah.
AUDIENCE: What's your standard delta t?
ADAM OLINER: You mean the sampling interval?
AUDIENCE: [INAUDIBLE].
ADAM OLINER: Yeah, So this depends on
the operating systems.
So on iOS, for example, you can't schedule regular
intervals to wake up.
So all we can do is subscribe to various notifications.
And then we're at the whim of when those
notifications get triggered.
These are things like when the battery level hits 5
percentile marks, when someone plugs or unplugs the
device, and so on.
So there's more flexibility on Android in terms of when you
can schedule these sorts of things.
So it depends, basically, is the answer.
On Android, it's still pretty infrequent.
So once every 5 minutes or so.
And I'll talk about what sort of overhead this incurs a
little bit later.
AUDIENCE: I can start Facebook [INAUDIBLE] and then it
wouldn't get sampled?
ADAM OLINER: That's right.
That's right.
So the granularity on any individual device is very low.
And so one of the reasons that Carat is able to work at all
is because we do this on a large number of devices.
OK, so this segues nicely into the question of dealing with
uncertainty.
So there are lots of sources of uncertainty, and I'll talk
about just a couple of those in this talk.
So for one, I'll talk about the case of iOS and measuring
the battery level.
So the battery API on iOS is kind of interesting.
So there's a particular type of notification called battery
level changed.
And when that triggers the battery life that it tells you
the percentile is roughly accurate.
So if it says 85%, then the true battery level is roughly
around 85%.
But for all other notifications, under all other
circumstances where you ask the battery API, what is the
battery level?
It can be as much as 5 points above the true value.
So it may say 85% and it could be anywhere
between 80% and 85%.
So when we were first starting to develop Carat and
discovered that this was true, it was a little bit
disheartening.
We thought it might be a show-stopper.
But we came up with this neat statistical trick that we used
to deal with this kind of uncertainty.
What we're going to do is we'll take the measurements
that we believe to be accurate, these battery level
changed events.
And we'll use this to build a prior distribution.
And this prior distribution is what we'll use to fill-in the
sort of uncertainty gaps in the other information, as I'll
now describe.
So you observe the battery level over time.
And let's say this blue line is the true battery level.
And Carat, every now and then, is able to wake up and take
these samples.
And so the green dots here are going to be samples taken when
the phone was discharging.
And the red ones when the phone was charging.
So we'll throw out the red ones.
And now, let's look at some particular consecutive pair of
these discharging samples.
And as I said, this is how we compute these rate
distributions.
So if green is the value that the battery API returned, we
know that it could be as much--
as low as the extent of this black bar that's
hanging down from it.
And so the actual rate of battery drain between these
two samples could be as low as Y, meaning the low point of
the first value and the high point of the second, or as
high as x, the high point of the first one and the low
point of the second one.
And so these are our bounds of the low and high potential
battery drain.
And we use this to take a slice of the prior.
And then once we have this slice, then we build a new
probably distribution.
And rather than account a single value as the rate of
discharge between these two samples, we instead have a
distribution of discharges.
And this is based on the accurate measurements that we
took from these other events, these battery
level changed events.
So this is one of the ways that we deal with uncertainty,
is by using the statistics that we have from other
measurements, both from that phone and from other phones to
essentially fill-in these gaps.
OK, so another question you might ask is, whether these
sort of tricks work.
Does the sampling that we get from Carat match the reality
of energy consumption?
And also, how much this Carat cost to run?
So if you have this going on your device and taking these
samples, are you draining your battery just as badly as
whatever the offending app was?
So to test this, we did a couple of things.
So one was that we got this Monsoon Power Monitor, which
is a neat little device that you can use.
And actually, the previous slide shows
it on the left here.
And we rigged it up to an iPhone, so that we could
measure the actual rate of energy drain.
And we also partnered with a company, a battery company
called Leyden Energy down in the South Bay here, which has
a bunch of battery testing equipment.
So they can hook up devices to their machines and run various
usage scripts.
And so for the iPhone and a Galaxy Tab 2, we did these
experiments where we would run 8 to 10-hour usage scripts,
doing things like browsing the web, email, and so on.
These were repeatable scripts.
And tested what the battery drain was, as well as looking
at Carat taking samples.
And then a third experiment where Carat was not running,
but we ran through the same usage script.
And so we'll use these to quantify how accurately Carat
is able to measure the energy consumption, and also how much
energy Carat is taking up.
Yeah?
AUDIENCE: First, can you repeat the question
[INAUDIBLE]?
Have you guys just published these stats by themselves?
This seems generally interesting to people
[INAUDIBLE].
ADAM OLINER: Your question is, have we published the stats?
The stats being which stats?
AUDIENCE: [INAUDIBLE].
ADAM OLINER: Oh, just like, what are the raw numbers of
how much energy gets consumed by various things?
AUDIENCE: [INAUDIBLE].
ADAM OLINER: So we've mostly been focusing on the app
developers at this point.
We've talked a little bit with folks at Apple.
And I'm here, hopefully, to talk to the Android people.
But it seems like they don't care
about energy or something.
I don't know.
If they care, they can talk to me.
I did my part.
So the customer service at Monsoon was fantastic.
So we had some trouble getting the iPhone
hooked up to the device.
As it turns out, Apple products are not intended to
have wires soldered onto them and whatnot.
They're not designed to make that easy.
And so they actually drew us this hand diagram.
In the picture, I guess that's like a TI-82 running Windows,
or something like that.
But it's supposed to be an iPhone.
Anyway, so it worked.
And that was pretty exciting.
OK, so just to give a little bit of a feel for some of
these numbers, obviously there's a lot of data here and
I'm not going to talk about all of it.
But the first high-level bit is that the values that we see
on the battery indicator that's returned by the API
agree with what you see coming out of the power monitors.
So these are just numbers that, more or
less, track each other.
The second high-level bit is that you get pretty good
accuracy in terms of using the prior to compute what the
energy drain distribution is.
So this orange curve is the discharge distribution as
estimated by Carat.
And the green and black ones are the distribution as
measured by the numbers that we got out
of the power monitor.
And they don't match exactly, but the thing to recognize is
that this is from one device over a single day.
This is data from the iPhone.
And so over the course of that experiment,
Carat took nine samples.
It only measured the battery level nine times.
But using the prior, was able to get this accurate of a
distribution.
The other high-level bit here is that running with and
without Carat is basically the same amount of battery.
And just to give a little bit of quantities to these, during
the experiment we did with the Galaxy Tab, Carat only
underestimated the battery drain by 0.00015% per second.
On iOS, it was a little bit worse.
So these were fairly good given that they only had less
than a dozen samples compared to the thousands and thousands
that the power monitors were Measuring.
And the second is that Carat is extremely low overhead.
So during the experiment where Carat was not running, in its
place we ran the standard Weather app.
And Carat uses less energy than the Weather app.
Weather app uses 3.5% more of the battery over the course of
this experiment than Carat.
So that was the ground truth experiments.
I'll talk now a little bit about the implementation,
focusing on the back end.
So as I said before, this is what the Carat infrastructure
looks like.
You have the crowd, which communicates data to a central
server, which is then sent to the back end analysis.
And mostly today, I'll be talking about this back end.
So to implement the analysis, we used a language which was
developed at the AMP Lab at Berkeley called Spark.
Spark is a cluster computing framework that uses a
structure called Resilient Distributed Datasets.
And the merit of these distributed datasets is that
they reside in memory, and they persist
across runs of jobs.
And so the idea is that rather than, like a MapReduce job
where you have to reload everything into memory, if you
want to run it again, they instead
stay resident in memory.
And the intention here is that you will run iterative jobs or
interactive workloads.
And so, this is one of the merits of the RDDs.
They also provide various other features, such as the
ability if a node crashes, to efficiently recompute the data
that was lost.
And so it's actually quite a cool language.
And we've had a lot of success with it.
And I'll show you some of the parallelization numbers in
subsequent slides.
So one of the main challenges that we faced when trying to
parallelize this was the process of converting the
samples to rates, and then the rates to distributions.
And to get the samples to rates, one of the main
challenges there is that we had this inter-sample
dependency.
Where, in order to compute a rate, we needed these
consecutive pairs.
But then the subsequent rate needs the second sample from
the first one and so on.
And so one of the things we had to do was convert
everything into these consecutive sample pairs and
use those as kind of the unit of information.
So there was some replication of data going on there, but it
removed this dependency that we had between samples, and
let us paralyze much more efficiently.
And then this is a slide that's showing a little bit
about how we do the conversion between rates and
distributions, just to give you a sense of what a workflow
might look like in Spark.
And it should look very familiar to you.
It's operations like map, and reduce, and group by.
And so this uses the same sort of language that you see in
MapReduce operations, except Spark supports general graphs
instead of just specific kinds of graphs.
And so we were able to leverage the Spark parallelism
in order to do this and use RDDs to good effect.
And I won't talk too much about the details here.
I think it's fairly self-explanatory.
And if there are questions about it, then I can field
those individually.
So Carat, as you might imagine, is very low traffic.
So any individual device is not communicating very much
information to the servers.
On average, it's way, way less than a byte
per client per second.
And this means that with relatively few servers, we can
handle all of the both incoming and outgoing traffic
that all of our users generate.
This is currently being handled by 5 AWS instances
running with a load balancer in front.
But really, this is overkill.
So the server itself is not a bottleneck for us.
I'm just sort of showing you this to say, don't worry too
much about the central server.
The main issue is dealing with the analysis.
So parallelizing the analysis was absolutely essential.
We had done a sort of naive parellelization of it when we
started out with our initial sort of lab distribution of a
few dozen users.
But as I'll describe later, this fairly quickly exploded
and we had to do a proper parellelization of it in order
to make it scale.
So just to compare here, this is an optimized serial
implementation.
And on the x-axis, you have the number of samples.
So the red is the optimized serial implementation.
And then the green one is when you actually
parallelized this in Spark.
So essentially, by the time we have the number of users that
we currently have, which is about 340,000, the analysis
would have taken more than a day to run if we had stuck
with a sort of serial implementation.
In Spark, we can do the whole thing in about 45
seconds from scratch.
So I'm going to spend the rest of the talk describing the
deployment that we have, and
explaining some of the results.
Just giving some examples of some of the anomalies that
we've seen and so on.
Yeah, Phil.
PHIL: So as a user, what is the significance of the
[INAUDIBLE]?
Do I have to go to the server and wait [INAUDIBLE]?
ADAM OLINER: Yeah.
So we're changing how this works.
So I'll explain the process of how from a user's perspective
they interact with the Carat app.
So currently the way it works is you start up Carat and it
basically just doesn't have any information for you.
It will start taking samples and sending stuff to the
server, but we'll just say that it has no results yet.
And this was true initially just because we had so little
data that we couldn't bootstrap it on anything.
But now that we actually have users, the intention is that
you can immediately report what apps that user is
running, and give back to them information about the hogs.
So at the very least, we can populate the hogs list when
they first open the app.
And this is something that we intend to do soon.
The bugs, on the other hand, require us to observe that
particular device over time.
So the speed of this is not bounded by the analysis speed
in terms of, how quickly can we tell them about bugs?
It's really just a function of, how quickly can we get
enough data from them to have enough confidence that we can
say something about which of these apps are misbehaving?
And so that's really the limiting factor.
The reason that having an analysis that runs quickly is
important is that, first of all, it wasn't like 45 seconds
versus a minute and a half, or something, that was the issue.
It was that it was taking more than a day for people to start
getting any sort of results.
And that was frustrating them, and us.
Another thing that's nice about having it be only 45
seconds is that these diagnosis trees that we build,
we can go much deeper and start to look at things like
combinations of features as opposed to just individual
features, and splitting based on those.
So all of these basically give us more flexibility in terms
of what sort of situations we can look at.
All right, so initially, back in I guess late January early
February, we had a little iOS implementation and we
distributed it to a small number of people.
You're allowed to basically get 100 people
to sign up for this.
This is a cap that Apple imposed .
Of those people, 75 installed.
The 25 who signed up and then did not install,
unfortunately, we couldn't reuse their spots.
You can't recycle unused spots for these sort of beta
distributions, which is frustrating.
But so this initial deployment was 75 people.
And from these 75 people, over a couple of weeks we collected
about 10,000 samples.
So the first thing to know here is this is a really small
initial deployment.
There's very little data.
And so despite that fact, we're able to find 35 apps
that exhibited energy bugs, anomalies in general.
But 35 in particular that exhibited energy bugs.
And these were in popular apps.
So things like Facebook, and Kindle, and Flipboard, which
is a digital magazine app.
And we were able to corroborate these things with
forum posts, and news articles.
And even, in some cases, the implications of some of these
forum posts was that there were particular features that
were triggering these misbehaviors.
So interactions with a particular OS version, or
issues when they didn't have Wi-Fi access, and so on.
And we were able to actually corroborate those with the
data that we had.
So despite this relatively small deployment, we were
already starting to find some interesting energy anomalies.
So this was a very encouraging initial result.
We also tried doing some synthetic bug injection.
So we took the Wikipedia mobile app and we wrote in
some behaviors, some misbehavior that we could
manually trigger that would cause the app to
use a lot of energy.
Basically, abuse a resource like the GPS or
the CPU, and so on.
We installed this on one of the devices in the deployment
and then tested whether when we triggered that misbehavior,
whether Carat came back and said, it looks like Wikipedia
running on that particular device is
exhibiting an energy bug.
And in all three cases, it did so.
And in no other cases on any of the devices running
Wikipedia mobile did it say that it was anomalous.
So essentially, it did the right thing in this injection
experiment.
So both of these initial results were pretty promising.
And so we developed an Android version and submitted both of
these things to the respective app stores.
And then, after a bit of a delay, finally in mid-June,
they both got in to their respective app stores.
And then, we got featured on TechCrunch, which kind of took
us by surprise.
And it was very exciting, so we were pretty thrilled that
we might go from like 75 users to some larger number.
Then Lifehacker picked up the story from TechCrunch, as did
dozens of other news sources.
So within 24 hours, there were at least dozens of articles.
And we rocketed to more than 100,000 users
in less than a day.
So I basically woke up, and then chucked Flurry and said,
oh, no, that's a lot.
And then, within the 24 hours after that, we
went to 200,000 users.
And so the rest is history.
So today, we have about 340,000 devices.
Most of those are iOS, but Android is gaining quickly.
I think mostly the reason that iOS started out ahead is that
the TechCrunch readership, I think, is a little bit biased
toward Apple devices.
I can't back that up, that's just my
impression of the crowd.
And so far, at last count, we have something
like 20 million samples.
So this is a far cry from the 10,000 that we used to have.
And obviously, there's a lot more that we're
finding in this data.
So to give you a sense of what these numbers look like, what
sort of energy anomalies we're seeing, about 9.4% of the
reported apps qualify as energy hogs.
This was about 11,000 apps.
These include the ones that you might expect, like Pandora
and Skype, things that use a lot of resources.
But what's surprising is that some of the energy hogs are
not apps that you would expect to be energy hogs.
So, for example, there were a family of basically, airline
search-related apps.
So you could check whether there were flights available
and so on, were using just far more energy than they ought to
be using compared with other similar apps.
And also, we found lots of instances of energy bugs.
So about 5.3% of the instances of apps running on devices
qualifies as buggy instances.
These include things like Kindle, and
Facebook, and YouTube.
And I'll describe a couple of examples of energy bugs in
subsequent slides.
And obviously, I can't even scratch the surface on some of
the interesting stuff here.
So I'm just going to give a couple of examples.
If you're interested in seeing examples of energy hogs and
energy bugs, I encourage you to run Carat on your device.
You'll probably find some.
So one example of an energy bug that we found was Kindle
running on iOS.
I particularly like this example.
I think it was really interesting.
So this was reported as a bug on 3.9% of the clients.
So a relatively small fraction of the people running the
Kindle app were seeing way more energy drain than the
other users.
And this has already been sort of discovered by frustrated
users who were running the Kindle app.
And they were complaining on the forum.
And one of the theories that was pushed forward was that it
was related to this WhisperSync protocol.
So if you've used one of these Kindle apps before, as you
know, it synchronizes your bookmarks and the books that
you've purchased-- obviously, annotations and so on--
between your various Kindle devices.
So it turned out that there was a bug in this
implementation, a bug in the protocol.
Such that when you did not have access to Wi-Fi, the
Kindle app would use significantly more energy than
if you did have access to Wi-Fi.
And I thought this was interesting because it's kind
of counterintuitive.
I would think that you would use more energy when you're
using an energy-hungry resource, like the Wi-Fi, than
when you're not.
But it turned out to be the opposite.
And it was a fairly significant difference.
So if you just turned on Wi-Fi when you were using the Kindle
app, you'd get another 36 minutes of battery life.
And this is a statement that we can say with 95%
confidence.
That for a typical buggy user of Kindle app, that's what
they will see.
And the diagnosis tree that we build looks
something like this.
I'm just showing a part of this tree to give you a sense
of the sort of information that we can compute.
So this is saying the battery life without Kindle versus
running with Kindle.
One thing that's interesting is that actually you get
better battery life running Kindle than when running
standard other apps.
So when I say without Kindle, that means running an average
of all the other apps that someone might run.
So in other words, on the whole, Kindle is a very
energy-efficient app.
[INAUDIBLE] displaying something on the screen.
It's not like you're playing a game that's using
a lot of the CPU.
It's not constantly using the network and so on.
So it's actually not an energy hog, which is
important to recognize.
So it's actually a fairly
energy-efficient app in general.
But it turns out that during this WhisperSync process, it
uses a lot of energy.
So you can see this by looking down the next level on the
tree, which describes network connectivity.
So this with Kindle is 8.4 hours is the sort of
agglomeration of the three distributions below it-- when
the network's off versus when you have these two different
types of connectivity.
So what this is saying is that when the network's off and
it's not trying to do this synchronization, you get
pretty good battery life.
It's using basically no energy.
But when it's actually doing the sync, it's
energy-inefficient.
And in particular, it's less energy-efficient using 3G than
when running Wi-Fi.
And this is the sort of surprising aspect of this
diagnosis, to me, at least.
So before I move on, is there a question about this?
AUDIENCE: The bug on the previous slide that you were
talking about, wouldn't that-- sorry.
How is that reflective in this [INAUDIBLE]?
ADAM OLINER: So the difference between running with the 3G
versus running with the Wi-Fi.
So essentially, if you're in the 3G node and you turn on
Wi-Fi, then you move over to this Wi-Fi node.
And so that's the difference in energy that you would
expect to see during the synchronization process.
AUDIENCE: That makes sense.
You expect to get more battery life Wi-Fi than on 3G.
ADAM OLINER: Do you?
AUDIENCE: 3G radio take more power.
ADAM OLINER: But your 3G radio, it's on anyway.
AUDIENCE: I might not have a good [INAUDIBLE].
ADAM OLINER: Yeah.
AUDIENCE: It's not surprising me.
ADAM OLINER: Well, some of these resources are
complicated.
So like the radio, for instance, has tail energy that
you need to consider.
So if something else uses the radio, then you pay for that
tail energy for some period of time following the actual use
of the resource and so on.
So all of these computations that we're doing are just
talking about averages.
So it's very hard to say for a specific device what's likely
to be the energy consumption.
Because there are lots of kind of chaotic things going on.
But as a sort of general statistical statement, users
who have Wi-Fi access are getting better battery life.
Maybe I've naive to be surprised by this.
AUDIENCE: That's exactly what they said.
ADAM OLINER: OK.
Well then, there you go.
So turn on Wi-Fi, kids, I guess is the moral.
OK, so I'll move on.
So another example is the Twitter app on Android.
So most of you are probably familiar with Twitter.
This was reported as a bug on about 15% of the clients
running it.
So this is a relatively large fraction, I think.
If you look at the diagnosis tree, which we sometimes call
an MCAD for reasons that I won't go into, it implicates
the operating system version.
So in particular, what it suggested was that users
running Ice Cream Sandwich 4.0.4 were seeing 94 minutes
more battery life than Twitter users who were running any
other version of the operating system.
So there was something going on with Twitter interactions
with these other versions of the operating system that was
causing a significant drain on the battery life.
So this was a serious problem.
It also turns out that Wi-Fi helps.
Maybe.
This is now becoming like a thing.
Get on the Wi-Fi network ASAP.
So this is interesting.
So this is another actionable thing that you can do.
So in addition to the recommendations of killing an
app, or restarting an app, Carat sometimes will tell you,
upgrade your operating system.
And currently, that recommendation, at least on
the UI side, is only coming about if you see that users of
a particular version of the OS, as a whole, are getting
much better battery life.
But on the back end, there are all sorts of things that we're
computing that would allow us to say, you're a Twitter user.
And so if you're going to use Twitter a lot, you should
definitely upgrade because it seems to make a big
difference.
So we're computing all of this on the back end, we just
haven't been focusing on the UI recently.
Such is research, I suppose.
OK, is there a question about this story before I move on?
OK.
Cool.
So in aggregate, what sort of effect does Carat seem to have
on battery life?
So I'll describe this plot, and then there are a bunch of
caveats in trying to talk about this' sort of data that
I'll describe as well.
So the x-axis here is the days since the first report that
Carat gives the user.
So user uses it for a while, and then at some point, Carat
thinks that it has enough information about that user to
report back and say, here are some energy hogs and some
energy bugs on your device.
So 0 here is the day that Carat has made that decision
and sent out that information.
It does not necessarily reflect anything about when
the user first read those pieces of advice, when the
user first did anything about them, and so on.
All of that is very hard for us to measure from
our side of the fence.
But in aggregate what it says is that after 10 days, the
average Carat user sees about 10% more battery life.
And then, after 90 days, it suddenly [INAUDIBLE]
up to about 30%.
So this is a significant improvement in battery life.
There are obviously a number of caveats with this.
So the first is that we have a very biased sample of the
population.
This is users who sought out and installed, and then
continued to run Carat for 90 days or more.
So frequently, these users had pretty bad battery life to
begin with.
So the baseline may have been quite low.
The second thing is that users typically will not do a nice
controlled experiment for us where they will change no
other behaviors except running Carat, which is unfortunate,
but that's what we got.
So there you may have been other things that these users
were doing during this period of time that caused their
battery life to improve.
So one of the things that they could have done even is,
actually replaced the battery with a new one at some point
during this process.
So this is aggregated over hundreds of thousands of
users, but there are various confounding factors that we
just don't have the data to disambiguate.
All we can say is that in a general sense, it seems like
users of Carat are improving their battery life by double
digits, even after relatively short amounts of time .
Yeah?
AUDIENCE: How wide is the spread variation?
ADAM OLINER: I don't have those numbers for you,
unfortunately.
We did compute them, but I just don't remember them.
AUDIENCE: [INAUDIBLE]?
ADAM OLINER: The 95 percentile band.
So it starts out and it's pretty thin.
It gets a little bit bigger as you go out in the graph
because there are fewer users.
That's sort of why you see this jaggedness.
There are fewer users who have been running it for that long,
just because we haven't been around for that long.
I mean, 90 days is almost the entire
lifetime of the project.
OK, so one thing I talked about before is this
convergence of the error bars, the increase in our confidence
as we get more data.
And so this is an example of this happening.
This is an aggregate of the error bars and the expected
value estimates for several hundred of these energy hogs.
So on the x-axis, you have the number of samples.
And so again, we have like 20 million samples, or something.
So these are really small numbers that we're looking at
here in terms of how many samples.
And on the y-axis, you're looking at the relative
expected value, which is--
so 0 would be a perfect estimate of the true expected
value based on all of the data as opposed to just this
sampling of it.
And then the error upper bound and lower bound are these
dashed dotted lines.
Essentially what you're supposed to see here is that
those two converge very quickly, so your error
decreases rapidly, as does the accuracy of your estimate of
the expected value.
This is very helpful to us.
I guess this is the point.
So this is a property that we relied on and are glad to see
in practice.
OK, and then the last thing that I'll mention in terms of
the data here is the prediction accuracy.
And this graph requires a little bit of explanation
because it's actually very hard to
visualize this statement.
Essentially, what's happening as that Carat is saying, if
you kill this app, you'll get, let's say, an hour more
battery life, plus or minus 10 minutes with 95% confidence.
Let's say that a user sees that kind of a statement.
So what we'd like to do is assess the accuracy of that
statement over a large number of users, over a large number
of these recommendations.
So if you had perfect prediction what that would
essentially mean is that if a user--
an hour plus or minus 10 means if they went from using it all
the time to using it never, they should see an hour plus
or minus more battery life.
We treat this as a linear relationship for the purposes
of assessing its accuracy.
And so if they go from using it all the time to just using
it, say, 50% of the time, then they should see
half of that benefit.
It should go 30 minutes plus or minus instead of on hour
plus or minus.
And so, this is essentially describing--
this hour number is essentially describing the
slope of a line.
And so we can characterize the slope that users see in
practice when running these apps.
So we can say, the user running this particular app,
when he goes from using it 60% of the time to 20% of the
time, how does his battery life change?
And so you can do this over a large number of users for a
large number of user recommendations.
So the green line here would be perfect prediction.
That would be the slope if we get it exactly right.
And each of the other lines is representing a particular
recommendation for a particular user.
So one of these lines might be Dan's phone running Facebook.
And we told him 45 minutes, so it should be right on that
green line if it were perfect protection.
And the closer these lines are to the green line, the closer
we were to being correct.
So the gray lines here are ones that were within the 95%
confidence bounds for that particular recommendation.
So if we told him an hour plus or minus 10 and it was an hour
and 7 minutes, then that would be within the confidence
interval, within the error bounds, and so that would show
up as a gray line here.
And the orange lines are ones that fell outside of the
confidence interval.
So we measured this for a large number of these
recommendations and for the 95% confidence bounds, 95.4%
were within those bounds.
So this is exactly what one would hope.
Just under the wire there, right?
So 95% would have been the lowest that we could go.
So this was a very good result for us.
We were pretty excited about it.
Basically what it means is that we're able to not only
say with pretty good accuracy what sort of battery life
improvements users should expect to see, but even the
error bounds that we were computing seemed to encompass
the right fraction of the users.
Yeah?
AUDIENCE: [INAUDIBLE].
Is that just a one-to-one thing, that slope of the line?
ADAM OLINER: It's a relative slope.
I wanted to get a lot of information into one graph.
So the idea here is--
AUDIENCE: [INAUDIBLE].
It's not like the other slopes are different slopes.
They're all--
ADAM OLINER: They are, actually.
AUDIENCE: They are different slopes?
ADAM OLINER: Yeah.
They are slightly different slopes.
It's a difficult graph to describe.
I'm trying to figure out better ways of visualizing
this sort of information.
But the high bit here is that bullet point, which is that
for a 95% confidence bound, more than that
are within that error.
So it was good.
OK, so I'm just going to wrap up here by talking a little
bit about what we're working on now.
So the first is we're still working to improve the error
and conference bounds that we're giving to people.
So we'd like to be able to give more complicated
diagnoses and still be able to attach these sort of
confidence intervals.
So telling a particular user that when you're running
Twitter, you should turn on Wi-Fi or something like that.
Or upgrade to Ice Cream Sandwich 4.0.4.
Another thing that we're looking to do is build an API
for developers.
So if we, for example, see that Twitter is exhibiting an
energy anomaly on a particular subset of the devices, we
don't know, for example, whether that subset of devices
just has a particular setting inside of the app, or if the
user is just interacting with that app in a particular way.
That's not something that's visible to Carat.
So one of the ways that we can get around this is if the
developer of that app adds a little bit of instrumentation,
something that you see in, like Flurry, or similar sorts
of libraries.
And so this would let us know, what are the settings of that
app and so on, and give us additional features that we
could mine for relationships between energy consumption and
those features.
So this is part of a sort of vision that we have for doing
this sort of collaborative debugging, just deploy to the
crowd and debug in the cloud.
There's nothing specific to the analysis about energy.
So that semantics is entirely visible to everything that the
back end does.
So there's no reason that we can't also use this for
performance debugging, for example.
Or really, the consumption of any of these resources.
And more broadly, we're starting to look at this as
statistics as a service.
So we've been talking about applications of
this to other domains.
For example, one of my favorites is the body metrics.
So a lot of people have like sleep monitors, and running
monitors, and so on.
And you could ask similar questions about
those sorts of data.
So compared with other people who have running habits like I
do, how do my sleeping habits compare?
Why am I not getting enough sleep?
How do the people who are not getting enough sleep, what are
the dimensions in which they differ from the people
who do, and so on.
So anyway, it seems like there are a lot of interesting
questions that you can ask of the kinds of--
of the form of the questions that Carat asks.
And so we're starting to build up sort of statistical
service, so that you can start to answer those questions.
If you want to download Carat, either for iOs or Android,
there's a link at that URL.
The client code for both platforms is on GitHub.
So feel free to fix any bugs that you find and help us
improve the app.
We certainly are shorthanded, so we appreciate the help.
So go download it.
We're keen for feedback.
And we'd love to talk with some folks here about some
more ideas.
Anyway, that's all.
Questions?
Yeah.
AUDIENCE: Are the decisions trees that you showed
automatically generated?
What do you use to [INAUDIBLE]?
ADAM OLINER: They are.
So the question was, how do we build those decision trees?
What's the algorithm for exploring the feature space?
And that's a rapidly changing question.
So currently, it's fairly brute force.
The feature space that we're considering is a little bit
constrained, so we're only looking, for example, at Wi-Fi
access, the device model, operating system version, and
a couple of other things.
So doing a brute force
exploration is still tractable.
In the longer term, we're obviously looking at ways of
prioritizing how to explore the space, et cetera.
And so I'd be happy to talk with you about that, actually,
if you have ideas.
Yeah?
AUDIENCE: I have a contentious idea for
getting more blog attention.
[INAUDIBLE].
ADAM OLINER: So the question was, could we get more
attention by shaming people into being aware of various
energy problems and so on?
Yes, we could.
So the issue here is that it's basically three people working
on it, one of whom is international.
And none of whom, as a primary job, is trying to build a
product or garner attention and so on.
And so there are lots of cool things that we have in our
heads that we would love to be doing, we just don't currently
have the people to do it.
So that's a problem that we're trying to solve, but--
yeah?
AUDIENCE: And you can only get enough data at this point,
too, to [INAUDIBLE].
ADAM OLINER: Yeah, we have more data than I ever expected
us to have.
So yeah, it's really exciting, actually.
All right, cool.
Thanks, guys.