Tip:
Highlight text to annotate it
X
MATT WARD: So I guess we can get started.
My name's Matt Ward, and this is my
friend, Steven Robertson.
And today we're going to tell you a little bit about how
we're changing the way we stream videos at YouTube.
We're going to talk about adaptive streaming.
And as you know, we'll be talking about
the internet today.
This is a great picture that my friends over at
collegehumor.com put together.
You can go check it out if you don't know any of these memes.
A lot of them are YouTube memes, which is why I'm
showing this off.
And we're actually going to go and talk about the birthplace
from our perspective.
This is not the artist-- once the artist uploads it, this is
kind of where everything starts for us.
And that's in our Google data centers.
We're making some big changes about how we're processing
these videos that we're going to go into.
And then the next part of these video's life cycle, if
you will, is traveling through these pipes.
And these are kind of the first pipes that these videos
go through as they leave our data centers and head towards
your computer.
It is one of the first possible places that we can
see congestion.
And there's a lot of congestion out there on the
internet today.
No, I have to click that and this slide.
And this is really kind of what we're doing to the
internet today with all of the data that we're putting
through it.
We're really just doing this.
[AUDIENCE LAUGHS]
MATT WARD: Don't worry, it'll start.
[SPEAKING JAPANESE IN VIDEO]
MATT WARD: We're pushing as much data as possible through
these pipes right now.
And it's kind of getting crowded in there.
So when it gets crowded like that, this is
what our users see.
They get a buffer spinner.
Your network gets congested, and you're not able to pull
data down as quickly as you were before,
and so, buffer spinner.
So today we're going to talk about a four-year project at
YouTube to change the way that we actually
stream videos to users.
We're going to talk about two projects.
The first project is called Zeri.
It's something Adobe introduced to us, a new API,
specifically, that Adobe introduced to us back in 2009.
It took us a couple of years to get a working prototype,
and another year to actually ship it to all of our users.
And then we started really focusing on HTML5, and we're
going to end today talking about how we're doing the same
kind of thing in HTML5.
So let's kind of get started.
We actually refer to this project as Sliced Bread
internally.
We talk about it as chunked video playback.
The best thing since sliced bread.
And that's because before for these APIs, we were delivering
video in giant loaves of bread like this.
There was nothing between the internet
and the actual plug-in.
Here, I'm going to give you a couple
flash examples to start.
And we really just used to send the video right into the
net stream, and the video would play.
That was it.
There wasn't any more complication to it.
And then we introduced the Zeri API, and now we sit right
in between the internet and the plug-in, and we really
have control over how quickly the data is
flowing through it.
And we chunk everything.
And we'll go into why that happens in a minute.
[INAUDIBLE]
MATT WARD: Exactly.
There's four reasons why.
We will go into each of them in more details, but let's
start talking about efficiency.
And like I said, chunking.
So the first piece of the pipeline that we built was
taking the video and requesting it in ranges.
So rather than just requesting the whole entire video in one
big request, kind of like putting data on as it comes
along, what we did is we said, OK, give me two megabytes.
And then I'll put that on.
Give me another two megabytes, and then I'll put that on.
And what this did was it actually reduced our
over send by 5%.
This is a really big savings for us and
great for the internet.
We actually were making more space on the internet for
other people to send bytes through, because we made this
improvement here.
So, let's go into another advantage that this gave us,
which is because we were doing this chunked request thing,
what's happening is when we go and seek around the video, we
will try and align your chunk request again so that the
browser can actually handle a layer of caching for us.
So if we seek back to, say, zero seconds, and you're all
at the end of the video, and the browser had that first two
megabytes of the video in cache, we can get access to
that really quickly.
And that's kind of the first layer of cache that we see.
And the next layer is actually a little
bit more in our control.
So the browser cache, we don't have any control over the
eviction that's going on there.
We can't actually say, do you have this two
megabyte chunk in memory?
The browser hasn't given us an API for that, and we don't see
one coming anytime soon.
So we introduce this next layer of cache
that's really in memory.
And that's really what you're seeing when you see a buffer
bar on our video.
That's the amount of data that we're actually holding in
memory inside of the Flash plug-in.
So once the data flows out of our in memory holdings, we
kind of send it up to this net stream object in Flash.
And that's where the decoder kind of lives.
And once we've passed the bytes in there, we really
don't have any control over it anymore, and the only thing we
can do is say stop.
So, let's talk a little bit more about this buffer bar.
So this is our great seek bar here.
And what we used to do is to try and prevent all that over
send we talked about a little earlier.
So we used to actually have logic in our servers that
trickled the bytes down.
Believe it or not, this logic wasn't perfect.
I know it's shocking.
But now that we are looking at how many bytes the client is
actually receiving, and we have this really accurate
measurement of how many bytes are arriving on the client,
we're able to effectively manage how much
read ahead we have.
And today we kind of cap that around 40 seconds.
And that's what we found was a good value.
When you made that value larger, you crash a lot of
Flash plug-ins.
Trust me.
So the next thing here that I've pointed out that we
changed was we had this other really advanced logic in our
servers to handle a certain kind of seeking.
So when we talk about seeking there's two kinds of seeking.
The first kind of seeking is what we
call in buffer seeking.
And that kind of looks like this.
You seek to an area that you kind of have already
downloaded and you have in memory in the browser.
And these we're able to handle by feeding the bytes we kind
of already have in memory right in there.
Now, what happens when you seek to this region, which we
don't really know anything about?
Right?
In the past, those in buffer seeks were really simple,
because the plug-in just handled them for you.
You didn't have to think about it at all.
But when you did this, the plug-in would say, I don't
know anything about that.
What can you do for me?
And we would re-request the whole entire stream again by
appending a start parameter to the actual URL, and say, hey,
I seek to a minute into the video, but I'd only downloaded
30 seconds, right, so I don't know about what's going on
that minute.
So really you would say, start at 60 seconds and give me a
video stream from there.
And the servers would stitch together a new video for us.
And we changed that, and now we do all of that logic on the
client side.
And it's made our servers a lot dumber, if you will.
But our clients are really, really smart.
And it also let us play some really neat tricks on the
client that we'll talk about later.
So to do client side seeking like this, we have to gain a
really, really clear understanding of what the
container is.
So I'm going to give a little background on what videos
actually look like.
And I'm going to make a little analogy for you.
Videos are like tacos.
They look exactly like this.
So what do I mean?
OK.
So, the meat of this video, the stuff that we really,
really care about is the underlying
video and audio codecs.
Here I give a few examples of that on the left over here.
And then, the container is this kind of stuff that wraps
around the outside.
It kind of tells you something about what's inside.
It's what keeps the whole video together, if you will.
It brings all of these individual samples that we've
encoded together into this one big glorious hunk of video.
So what we did in our project initially that we did in Flash
was answer this question of how do we know a byte offset
for a time that the user requests?
And we kind of built this really common interface for
the whole entire thing.
You know FLV, we have this time to byte map.
MP4s have these things called moov atoms.
Sidx, kind of the same story.
HLS and DASH each have a manifest.
The manifest says, oh, well here's five second chunk, five
second chunk, five second chunk, and that's kind of the
granularity you get there.
So we built a kind of common interface to that from time to
byte that our application can query into to find out this
answer that's so important for seeking.
The next part was for the samples that we actually hold
onto, we'll talk about this in a second, but we actually
sometimes switch the codec that we're using in the middle
of playback.
We're doing this a little less and less today, and I'll talk
about that later.
But let's first talk about what a video
really looks like.
And it looks a lot more like this than a taco.
So the videos kind of start with this thing that
initializes the decoder.
You kind of tell the decoder, this is the key frame I'm
about to feed you.
This is what--
the data you're about to see is going to look like this.
That's what we start by saying.
And then videos have key frames and progressive frames
and this simple example here uses H264 without any fancy B
frames or anything that you guys might know about.
And some basic AAC audio.
That's a lot of what we've been streaming these days.
So let's go into how we do adaptation.
One of the big things that we're able to do now is switch
between streams.
So here we have two streams.
A and B. And we detect that there's a
change in the network.
Right?
And what we want to do is this, make a switch.
So when we make a switch, that has to happen at a key frame,
because the decoder will not understand if you switch at a
progressive frame here.
How do I make this transfer from my current data to some--
each of the progressive frames kind of build on key frames.
So you must feed a key frame first.
And that's the logic that we also had to be very careful to
take care of.
So once we kind of had this understanding of all this
stuff, and really understand how to glue videos together on
the fly in our clients, we introduced this simple
splicing mechanism.
Now, this is going to look crazy, but this is what our
application got to at this point.
We were building and tearing down all of these things that
parse and handle different media containers, and cache
each of those individual files that you're pulling
separately.
And then we had this splicing thing that
switched back and forth.
And this API was really interesting and really fun to
work with, but Steve's going to talk about its implication
of this later in HTML5, so you guys don't have to deal with
any of this stuff anymore.
So the next part, once we have all this stuff, how
do we launch it?
Well, like I said, this took us four years from start to
finish, more or less.
And even once we finished, all of the stuff that you saw so
far, we had all of it like code complete.
We spent months finding bugs, and launching experiments, and
trying to figure out where are all the problems.
So let's talk about a couple of those problems here.
The first problem, remember how I had those nice pretty
lineups between all of videos?
That's not really what our videos look like.
They look a lot more like this.
So if you're going to switch between stream A and stream B
here, what's going to happen is you're going to have to
switch at that key frame.
Are you going to switch in the middle of that progressive
frame on the top one?
Or are you going to drop that progressive frame?
So what would happen is you would either get black
flashes, or you'd go back in time.
Some people don't notice this, because it's very, very quick.
But this was happening.
Well, I had a worse problem.
People are more sensitive about audio.
Audio--
if you're doing this in audio, you're making pops and
glitches is in people's ears.
This is bad.
So for a while, actually, we were paying the price of the
occasional pop and glitch in people's ears, for the cost of
less re-buffers and smoother play backs.
We tried to limit how many switches we would do during a
play back to minimize the number of these kind of
problems that users would see.
But what we're doing today is we're making our
videos look like this.
We're re-transcoding everything inside of YouTube
to really improve the experience.
We want all the key frames lined up so that when we
switch streams, you don't even notice any
black frames or anything.
The audio, for all of our qualities now, the
audio is the same.
When we're playing an auto adapted playback.
We hold the same exact audio stream at CD quality through
the whole entire thing.
It's great.
I mean, you know how everyone always used to switch up, to
like try and get that better audio quality?
Well, now we hold the same one for almost all of them.
This also enabled a bunch of other neat things, but Steve
might talk about those later.
The other major problem that we had was our caches.
So at YouTube, we have a bunch of caches, where
we store our videos.
This is shocking.
Basically, this is what they used to look like.
Formats A, B and C, for a video, were all
on different machines.
So when we wanted to pull one of these videos, we had to
open a new connection to a new machine, and say, hey, how
quickly can you get me those video bytes?
And the actual rate at which you can get them from these
two different machines might vary, because of, like I said
before, varying network conditions, right?
This is all about understanding what the network
is like between you and the server you're
pulling videos from.
When you start changing that in the middle of your
adaptation, it gets really, really complicated.
So we did this.
We moved everything on to one machine.
And that took months.
We had to redesign our caches from the ground up to make
this happen.
It was a very, very large project, as you
could probably imagine.
The last problem, latency.
So anyone want to guess where we launched adaptive streaming
and Flash in this?
We'll talk about that in a minute, but right in the
middle here, you'll see the blue line is Flash, red line
is HTML5, and all of a sudden, HTML5 is a
lot faster than Flash.
I'm going to hand it over to Steve to tell you about why
we're switching to HTML5 and why this is the case.
And he'll talk about HTML5.
STEVEN ROBERTSON: Thanks Matt.
So when we started doing application level streaming, a
new security mechanism kicked in.
That security mechanism is a cross origin request
restriction.
When you try and place a request to a new domain,
because all of our videos are stored on separate machines,
that request needs a security check first.
And that security check is implemented in Flash by making
an initial request to a policy file located at
the document root.
That initial request has to complete before the next
request cam start.
In HTML5, that information is actually embedded straight
into the HTTP request and response, and the browser is
responsible for dropping response.
It offers the same level of security, but
no extra round trip.
And that tiny little XML file makes the difference between
HTML5 being faster.
So this change actually has a lot of fortuitous timing,
because for the first time, we're also able to append
media directly to the video element from an application
control in HTML5.
The media source extensions are a new specification in the
W3, just won't work in draft.
And they basically do the same thing that Zeri did.
They offer an application and opportunity to append a data
straight to a media element.
And in fact, in the first draft, they basically look
exactly like Zeri.
And we tried this out, and we realized that it wouldn't
actually work.
Well, it would work on desktop Chrome, but we needed this to
be everywhere.
We needed this to be on phones, we needed this to be
on TVs, and the API just was not as
portable as the web required.
And understanding that takes us to one of the core axioms
of engineering work at YouTube.
Which is that playing videos is hard.
Surprisingly hard.
But it's a little easier to understand given the history
of video codec development.
This is just one facet of it.
But the codecs themselves are optimized for a tremendous
number of conditions and each one has separate trade offs.
But they're all gunning for lower bit
rate, or higher fidelity.
And none of the things that they're optimized are
simplicity.
In fact, the codec target for most next generation codecs,
whatever that generation is, is like, oh, we'll only double
or triple the complexity.
And we at YouTube are thrilled about this, because it does
cut down on the most important expenses, and lets us access a
lot more content and distribute it more widely.
But when you tell a bunch of very smart scientists and
engineers to make something more complicated over the span
of 40 years, they succeed.
So Adobe dealt with this originally in the Flash player
by basically drawing a map.
By codifying exactly what their platform would do under
all of these circumstances.
And that allowed developers to build applications for the
first time that would be able to do complicated mechanics,
like the things that we do in the YouTube player.
It was a tremendous boom for the web.
It allowed a lot of video applications and videos, of
course, really fundamental to all kinds of experiences.
And that was a fantastic move.
But it was a trade off, because when you build
applications with these implementation details in
mind, they become brittle.
They are hard to port.
They're hard to maintain.
And the Flash plug-in itself has a lot of portability
limitations due to restrictions from those kind
of implementation agreements.
By contrast, the W3 expects and mandates that their specs
are interoperable, that you can get multiple
implementations going.
And the append bytes model didn't really fit with that.
So, with the Chrome media team, we worked out a system
that was a little bit richer, had more of an intermediate
layer, so that you could hide those interfaces.
And so the media source spec became much more effective at
doing that.
The core feature of this is the source buffer.
The source buffer is an abstraction.
It's mostly a timeline.
And once you create a source buffer object, you can just
append media to it, after you've downloaded it by
whatever means.
And as you continue to append, to a certain point, the media
is retained on the timeline.
That's a lot like appending media to a queue
in the normal situation.
So to understand more about why this is so important,
let's talk about an Adaptive Upswitch.
The adaptive algorithm has a single goal, which is to put
the best video quality in front of
the user at all times.
And that means that if you're watching a video, and you
suddenly detect that you have additional bandwidth, the
player is going to start downloading new video, not at
the end of the queue, but as soon as it can, for the next
adaptive checkpoint.
And normally what you would expect is you just drop that
new media in there, and it will swap out for the media
that is playing at 10 seconds on this diagram.
But underneath, the decoder might have actually started
pulling data far ahead.
On Chrome, for instance, this is like two or three frames,
which is great.
On a lot of TVs, it could be 10 seconds read ahead.
And the platform needs to find a way to communicate this
effectively to the client if you were using it in a pen
bite style API, where you're doing all this manual
management.
On the other hand, the source buffer abstraction is capable
of keeping track of all this for you.
So all you do is drop in the media, and source buffer takes
care of knowing when to splice, and retaining the
existing information so that the decoder is pulling from
exactly the right samples.
And then once that process is done, it'll swap in exactly
what you told it to, so that way the representation in the
browser is exactly what the application expects it to.
So that's how we're doing adaptive streaming in both
HTML5 and Flash.
A more interesting question is what's in it for you?
Why is this beneficial for our users?
And on a familiar theme, the most important
reason for us is speed.
A lot of the existing adaptive schemes may be simpler for
applications to implement, but they have
a significant penalty.
For example, one streaming protocol, in order to do this
adaptation, requires multiple separate requests before it
can get started.
Some these requests are megabytes of things that
aren't movies.
And that cost has to be paid up front.
The media source solution allows us to embed a lot of
that logic, and cuts down the number of requests so that we
can start making media fetches at the top of the page,
because we can coordinate that with our application.
Another important feature is experiments.
YouTube's constantly running experiments.
At any given time, you can expect hundreds of experiments
are going on.
And because we see billions of playbacks a day, these get
data and are collected and processed very quickly.
And platform makers, browser manufacturers, just don't have
that instrumentation.
They would have to build a YouTube in order to collect
that much data.
So we can tune our algorithms a lot faster, release new
ones, and adapt to the changing network conditions
and geographic regions.
Another thing we find out from all this experimentation is
that there's a lot of region specific differences in what
people like to see.
For instance, in the United States, people would rather
watch a buffer spinner than see their video adapt down to
the very lowest qualities.
But in other parts of the world, people
would prefer the 144p.
They're like, don't even give me the 360P.
I want this to start quickly.
So you have to build these differences in a way that you
can see the responsiveness, and that requires this kind of
online continuous tuning and experimentation.
Another surprising thing we learned from this process is
that software has bugs.
Some bugs are more surprising than others.
For instance, if a popular mobile handset manufacturer
releases a software update that opens TCP connections and
doesn't close them, you get something
that looks like this.
A globally distributed denial of service attack on your
servers that lasts for months.
And there's basically nothing you can do to avoid this
situation except sit and wait and hope
they release an update.
That means that we have to make the choice between
degrading service for all users, or cutting off a large
fraction of users.
And we never want to make that choice.
With application level streaming, we can deliver a
fix or a work around in hours, instead of months.
And finally, there are a lot of new
features that this unlocks.
One that's great for our users is pre-loading the video, so
we can make sure that videos on the page start even faster.
By downloading exactly how much video we need without
going over.
We couldn't enable this on a global basis, because some
browsers will actually start downloading the whole video.
So having this application level control gives us the
assurance that we're doing the right thing by the user.
I'm really excited about global load balancing, but
that is an internal feature, and the only effective
response that you'll see is that things will be much more
stable over time.
You won't encounter these hiccups where your video
connection suddenly gets slow because one link is congested.
So being able to see this, as we mentioned, Flash, it is
live at 100 percent.
HTML5, the media source APIs are implemented in Chrome, and
we look forward to rolling those out.
You can join the HTML5 experiment to see how things
work today.
Or just stay put, and in the next couple of months, you'll
see this HTML5 experience shipping natively on Chrome
and then on any other browsers that pick up this API.
And we're also--
a lot of our TV and console applications are built on web
technologies.
So this same player will appear there, and handsets
will use the same streaming strategies.
So thank you for watching both this presentation and YouTube.
And if you all have any questions, we'd
love to hear them.
[AUDIENCE CLAPS]
AUDIENCE: Hi.
What is it for us like developers that need to give
our users content?
I mean, are you making any of these available for us to use
on our own websites instead of YouTube?
STEVEN ROBERTSON: The media source extensions?
The media source is a new web API.
So anybody can use it.
But what we expect to happen, and have seen some activity
around, is that content distribution networks will
actually be doing the same kind of experimentation and
produce algorithms that are tuned for their networks.
So what they'll expose to users is a simple player, a
JavaScript thing.
AUDIENCE: Mmm hmm.
STEVEN ROBERTSON: You just drop it in, and that
JavaScript takes care of all of the adaptive logic.
But of course, being a web technology, if you want to get
in there and actually implement your own stuff, more
power to you.
You definitely can.
AUDIENCE: Thank you.
MATT WARD: Of course, you're also welcome to use our video
player, which will just do this all for free.
AUDIENCE: Yeah.
AUDIENCE: Hi.
Great presentation, thanks.
STEVEN ROBERTSON: Thank you.
AUDIENCE: I'm wondering, why is there no always play in HD
option on YouTube?
There's one that is play in HD when you're in full screen,
but what happens all the time when I watch videos, I launch
the video, put it in full screen, and I've got like the
15 or 20 first seconds not in HD.
And it's like, OK.
I would always use the HD option if it was
there, but it's not.
Is there a reason for that?
MATT WARD: So I think part of what you're asking here is--
is your question more why am I not getting HD right away when
I go full screen or is it more--
AUDIENCE: So that I understood, thanks to the
presentation, it's why you do not let users actually always
have HD all the time if they want to.
MATT WARD: So you remember how the pipes are
getting really clogged?
So if we did that the pipes would be even more clogged and
nobody would be able to get through.
So we're doing the best thing here for a number of reasons.
For that reason we're conservative about how we're
using bandwidth on the internet.
In addition, we're never actually going to send you a
720P video into a little 360P screen.
This actually puts a lot more strain on your CPU and your
local processing, and will drain your battery faster.
We're thinking about things like that and trying to pick
the right quality for the size screen that you have, as well
as account for certain network conditions that
you might be seeing.
AUDIENCE: Thanks.
MATT WARD: Yep.
AUDIENCE: Hi.
I had a question about rate estimation, because in a
mobile device your radio changes very fast.
How do you guys do that, and how fast can you do it?
STEVEN ROBERTSON: Some of the more interesting strategies
for rate estimation on mobile devices involve packet pacing
to actually look at the interarrival time.
WebRTC uses this strategy.
And TCP doesn't have the-- at the web level, TCP doesn't
have the exposure to measure those APIs.
So one strategy we're exploring is to perform some
client server cooperation on TCP window clamping to get an
estimate of this.
But the short answer is it's a hard problem, and we're still
looking in to how to solve it.
It's definitely a challenge.
It comes down to being more conservative on mobile.
When you detect that you're on a mobile network, there's a
little bit more conservativeness as just our
fallback until a better solution is presented.
AUDIENCE: Thanks.
AUDIENCE: On iOS, certainly, and also on Chrome on Android,
at least on recent versions of Android, you live stream using
HLS, which is built into mobile Safari, and I assume
it's also built into Chrome on Android.
So is that application level streaming or is that system
level streaming?
MATT WARD: That's system level streaming.
That's actually exactly what we we're talking
about on that slide.
That was a picture of how HLS works.
So basically, on these devices today, we don't have access to
the APIs that we talked about in the second
half of this talk.
Those APIs don't exist.
And for that reason today we are still doing live streaming
via HLS, because we really want to hit all of the devices
on the market.
We really want you to have live streaming on your iPad
and your iPhone and your Android.
So that's really why we're doing HLS today on those
devices, and that's why you see that.
AUDIENCE: So you're going to switch over or
what is the road map?
MATT WARD: Yeah.
When the API becomes available on all these devices, we would
ideally like to see our applications having full
control over streaming during--
we want that application level control of streaming live.
We are fully capable of doing it today on the desktop, but
again, for portability reasons,
we're still using HLS.
AUDIENCE: So on iOS it would be implemented in your Chrome
implementation somehow?
MATT WARD: Yeah.
I mean Chrome, we hope, is finishing bringing media
source to our mobile Chrome very shortly, is my
understanding.
I don't work on the Chrome team, so I can't actually say
what the timeline of that is, but yeah, it's coming.
AUDIENCE: I would like to follow up on the HD question.
On my mobile device, I totally get I don't want my battery to
run down, I don't things to get hot.
But, say on the pixel, for example, when I've got my LTE
connection with 20 plus megabit download speeds, and
I've got a video that I want to--
I want to see the quality.
I'm that kind of person.
So I prefer to wait.
But it's coming instantly.
What protocols are in place that makes it go to a lower
quality setting whenever I've got plenty bandwidth and
plenty CPU processing power?
MATT WARD: One thing you might actually have is a lot of
bandwidth to a lot of the network.
So you might have a great, great connection to wherever
you're doing your speed test to.
You might not actually have a great connection to our
servers, so we might not actually have a 20 megabit
connection to your device on an LTE network.
So the number that we're actually seeing is the one
that is informing our stream selection logic.
So from our perspective, and from a lot of our experiments,
we value smooth play backs and fast play backs.
We're talking, actually, there's a lot of heated
discussion about this exact thing internally right now.
And we're trying to figure out exactly how we can give more
fine grain quality controls for people like you that want
really to crank it up if I can.
And today, our best answer to that is, go full screen.
And these algorithms are going to do their best.
AUDIENCE: OK, thank you.
And I will ask you right now.
I request that there be an option for HD playback, no
matter what is going on.
MATT WARD: I will accept your request.
[AUDIENCE LAUGHS]
MATT WARD: Well, thank you guys very much for coming
today, and enjoy the rest of the activities this week.
[AUDIENCE CLAPS]