Google I/o 2013 - Adaptive streaming for you and youtube

MATT WARD: So I guess we can get started. My name's Matt Ward, and this is my friend, Steven Robertson. And today we're going to tell you a little bit about how we're changing the way we stream videos at YouTube. We're going to talk about adaptive streaming. And as you know, we'll be talking about the internet today. This is a great picture that my friends over at collegehumor.com put together. You can go check it out if you don't know any of these memes. A lot of them are YouTube memes, which is why I'm showing this off. And we're actually going to go and talk about the birthplace from our perspective. This is not the artist-- once the artist uploads it, this is kind of where everything starts for us. And that's in our Google data centers. We're making some big changes about how we're processing these videos that we're going to go into. And then the next part of these video's life cycle, if you will, is traveling through these pipes. And these are kind of the first pipes that these videos go through as they leave our data centers and head towards your computer. It is one of the first possible places that we can see congestion. And there's a lot of congestion out there on the internet today. No, I have to click that and this slide. And this is really kind of what we're doing to the internet today with all of the data that we're putting through it. We're really just doing this. [AUDIENCE LAUGHS] MATT WARD: Don't worry, it'll start. [SPEAKING JAPANESE IN VIDEO] MATT WARD: We're pushing as much data as possible through these pipes right now. And it's kind of getting crowded in there. So when it gets crowded like that, this is what our users see. They get a buffer spinner. Your network gets congested, and you're not able to pull data down as quickly as you were before, and so, buffer spinner. So today we're going to talk about a four-year project at YouTube to change the way that we actually stream videos to users. We're going to talk about two projects. The first project is called Zeri. It's something Adobe introduced to us, a new API, specifically, that Adobe introduced to us back in 2009. It took us a couple of years to get a working prototype, and another year to actually ship it to all of our users. And then we started really focusing on HTML5, and we're going to end today talking about how we're doing the same kind of thing in HTML5. So let's kind of get started. We actually refer to this project as Sliced Bread internally. We talk about it as chunked video playback. The best thing since sliced bread. And that's because before for these APIs, we were delivering video in giant loaves of bread like this. There was nothing between the internet and the actual plug-in. Here, I'm going to give you a couple flash examples to start. And we really just used to send the video right into the net stream, and the video would play. That was it. There wasn't any more complication to it. And then we introduced the Zeri API, and now we sit right in between the internet and the plug-in, and we really have control over how quickly the data is flowing through it. And we chunk everything. And we'll go into why that happens in a minute. [INAUDIBLE] MATT WARD: Exactly. There's four reasons why. We will go into each of them in more details, but let's start talking about efficiency. And like I said, chunking. So the first piece of the pipeline that we built was taking the video and requesting it in ranges. So rather than just requesting the whole entire video in one big request, kind of like putting data on as it comes along, what we did is we said, OK, give me two megabytes. And then I'll put that on. Give me another two megabytes, and then I'll put that on. And what this did was it actually reduced our over send by 5%. This is a really big savings for us and great for the internet. We actually were making more space on the internet for other people to send bytes through, because we made this improvement here. So, let's go into another advantage that this gave us, which is because we were doing this chunked request thing, what's happening is when we go and seek around the video, we will try and align your chunk request again so that the browser can actually handle a layer of caching for us. So if we seek back to, say, zero seconds, and you're all at the end of the video, and the browser had that first two megabytes of the video in cache, we can get access to that really quickly. And that's kind of the first layer of cache that we see. And the next layer is actually a little bit more in our control. So the browser cache, we don't have any control over the eviction that's going on there. We can't actually say, do you have this two megabyte chunk in memory? The browser hasn't given us an API for that, and we don't see one coming anytime soon. So we introduce this next layer of cache that's really in memory. And that's really what you're seeing when you see a buffer bar on our video. That's the amount of data that we're actually holding in memory inside of the Flash plug-in. So once the data flows out of our in memory holdings, we kind of send it up to this net stream object in Flash. And that's where the decoder kind of lives. And once we've passed the bytes in there, we really don't have any control over it anymore, and the only thing we can do is say stop. So, let's talk a little bit more about this buffer bar. So this is our great seek bar here. And what we used to do is to try and prevent all that over send we talked about a little earlier. So we used to actually have logic in our servers that trickled the bytes down. Believe it or not, this logic wasn't perfect. I know it's shocking. But now that we are looking at how many bytes the client is actually receiving, and we have this really accurate measurement of how many bytes are arriving on the client, we're able to effectively manage how much read ahead we have. And today we kind of cap that around 40 seconds. And that's what we found was a good value. When you made that value larger, you crash a lot of Flash plug-ins. Trust me. So the next thing here that I've pointed out that we changed was we had this other really advanced logic in our servers to handle a certain kind of seeking. So when we talk about seeking there's two kinds of seeking. The first kind of seeking is what we call in buffer seeking. And that kind of looks like this. You seek to an area that you kind of have already downloaded and you have in memory in the browser. And these we're able to handle by feeding the bytes we kind of already have in memory right in there. Now, what happens when you seek to this region, which we don't really know anything about? Right? In the past, those in buffer seeks were really simple, because the plug-in just handled them for you. You didn't have to think about it at all. But when you did this, the plug-in would say, I don't know anything about that. What can you do for me? And we would re-request the whole entire stream again by appending a start parameter to the actual URL, and say, hey, I seek to a minute into the video, but I'd only downloaded 30 seconds, right, so I don't know about what's going on that minute. So really you would say, start at 60 seconds and give me a video stream from there. And the servers would stitch together a new video for us. And we changed that, and now we do all of that logic on the client side. And it's made our servers a lot dumber, if you will. But our clients are really, really smart. And it also let us play some really neat tricks on the client that we'll talk about later. So to do client side seeking like this, we have to gain a really, really clear understanding of what the container is. So I'm going to give a little background on what videos actually look like. And I'm going to make a little analogy for you. Videos are like tacos. They look exactly like this. So what do I mean? OK. So, the meat of this video, the stuff that we really, really care about is the underlying video and audio codecs. Here I give a few examples of that on the left over here. And then, the container is this kind of stuff that wraps around the outside. It kind of tells you something about what's inside. It's what keeps the whole video together, if you will. It brings all of these individual samples that we've encoded together into this one big glorious hunk of video. So what we did in our project initially that we did in Flash was answer this question of how do we know a byte offset for a time that the user requests? And we kind of built this really common interface for the whole entire thing. You know FLV, we have this time to byte map. MP4s have these things called moov atoms. Sidx, kind of the same story. HLS and DASH each have a manifest. The manifest says, oh, well here's five second chunk, five second chunk, five second chunk, and that's kind of the granularity you get there. So we built a kind of common interface to that from time to byte that our application can query into to find out this answer that's so important for seeking. The next part was for the samples that we actually hold onto, we'll talk about this in a second, but we actually sometimes switch the codec that we're using in the middle of playback. We're doing this a little less and less today, and I'll talk about that later. But let's first talk about what a video really looks like. And it looks a lot more like this than a taco. So the videos kind of start with this thing that initializes the decoder. You kind of tell the decoder, this is the key frame I'm about to feed you. This is what-- the data you're about to see is going to look like this. That's what we start by saying. And then videos have key frames and progressive frames and this simple example here uses H264 without any fancy B frames or anything that you guys might know about. And some basic AAC audio. That's a lot of what we've been streaming these days. So let's go into how we do adaptation. One of the big things that we're able to do now is switch between streams. So here we have two streams. A and B. And we detect that there's a change in the network. Right? And what we want to do is this, make a switch. So when we make a switch, that has to happen at a key frame, because the decoder will not understand if you switch at a progressive frame here. How do I make this transfer from my current data to some-- each of the progressive frames kind of build on key frames. So you must feed a key frame first. And that's the logic that we also had to be very careful to take care of. So once we kind of had this understanding of all this stuff, and really understand how to glue videos together on the fly in our clients, we introduced this simple splicing mechanism. Now, this is going to look crazy, but this is what our application got to at this point. We were building and tearing down all of these things that parse and handle different media containers, and cache each of those individual files that you're pulling separately. And then we had this splicing thing that switched back and forth. And this API was really interesting and really fun to work with, but Steve's going to talk about its implication of this later in HTML5, so you guys don't have to deal with any of this stuff anymore. So the next part, once we have all this stuff, how do we launch it? Well, like I said, this took us four years from start to finish, more or less. And even once we finished, all of the stuff that you saw so far, we had all of it like code complete. We spent months finding bugs, and launching experiments, and trying to figure out where are all the problems. So let's talk about a couple of those problems here. The first problem, remember how I had those nice pretty lineups between all of videos? That's not really what our videos look like. They look a lot more like this. So if you're going to switch between stream A and stream B here, what's going to happen is you're going to have to switch at that key frame. Are you going to switch in the middle of that progressive frame on the top one? Or are you going to drop that progressive frame? So what would happen is you would either get black flashes, or you'd go back in time. Some people don't notice this, because it's very, very quick. But this was happening. Well, I had a worse problem. People are more sensitive about audio. Audio-- if you're doing this in audio, you're making pops and glitches is in people's ears. This is bad. So for a while, actually, we were paying the price of the occasional pop and glitch in people's ears, for the cost of less re-buffers and smoother play backs. We tried to limit how many switches we would do during a play back to minimize the number of these kind of problems that users would see. But what we're doing today is we're making our videos look like this. We're re-transcoding everything inside of YouTube to really improve the experience. We want all the key frames lined up so that when we switch streams, you don't even notice any black frames or anything. The audio, for all of our qualities now, the audio is the same. When we're playing an auto adapted playback. We hold the same exact audio stream at CD quality through the whole entire thing. It's great. I mean, you know how everyone always used to switch up, to like try and get that better audio quality? Well, now we hold the same one for almost all of them. This also enabled a bunch of other neat things, but Steve might talk about those later. The other major problem that we had was our caches. So at YouTube, we have a bunch of caches, where we store our videos. This is shocking. Basically, this is what they used to look like. Formats A, B and C, for a video, were all on different machines. So when we wanted to pull one of these videos, we had to open a new connection to a new machine, and say, hey, how quickly can you get me those video bytes? And the actual rate at which you can get them from these two different machines might vary, because of, like I said before, varying network conditions, right? This is all about understanding what the network is like between you and the server you're pulling videos from. When you start changing that in the middle of your adaptation, it gets really, really complicated. So we did this. We moved everything on to one machine. And that took months. We had to redesign our caches from the ground up to make this happen. It was a very, very large project, as you could probably imagine. The last problem, latency. So anyone want to guess where we launched adaptive streaming and Flash in this? We'll talk about that in a minute, but right in the middle here, you'll see the blue line is Flash, red line is HTML5, and all of a sudden, HTML5 is a lot faster than Flash. I'm going to hand it over to Steve to tell you about why we're switching to HTML5 and why this is the case. And he'll talk about HTML5. STEVEN ROBERTSON: Thanks Matt. So when we started doing application level streaming, a new security mechanism kicked in. That security mechanism is a cross origin request restriction. When you try and place a request to a new domain, because all of our videos are stored on separate machines, that request needs a security check first. And that security check is implemented in Flash by making an initial request to a policy file located at the document root. That initial request has to complete before the next request cam start. In HTML5, that information is actually embedded straight into the HTTP request and response, and the browser is responsible for dropping response. It offers the same level of security, but no extra round trip. And that tiny little XML file makes the difference between HTML5 being faster. So this change actually has a lot of fortuitous timing, because for the first time, we're also able to append media directly to the video element from an application control in HTML5. The media source extensions are a new specification in the W3, just won't work in draft. And they basically do the same thing that Zeri did. They offer an application and opportunity to append a data straight to a media element. And in fact, in the first draft, they basically look exactly like Zeri. And we tried this out, and we realized that it wouldn't actually work. Well, it would work on desktop Chrome, but we needed this to be everywhere. We needed this to be on phones, we needed this to be on TVs, and the API just was not as portable as the web required. And understanding that takes us to one of the core axioms of engineering work at YouTube. Which is that playing videos is hard. Surprisingly hard. But it's a little easier to understand given the history of video codec development. This is just one facet of it. But the codecs themselves are optimized for a tremendous number of conditions and each one has separate trade offs. But they're all gunning for lower bit rate, or higher fidelity. And none of the things that they're optimized are simplicity. In fact, the codec target for most next generation codecs, whatever that generation is, is like, oh, we'll only double or triple the complexity. And we at YouTube are thrilled about this, because it does cut down on the most important expenses, and lets us access a lot more content and distribute it more widely. But when you tell a bunch of very smart scientists and engineers to make something more complicated over the span of 40 years, they succeed. So Adobe dealt with this originally in the Flash player by basically drawing a map. By codifying exactly what their platform would do under all of these circumstances. And that allowed developers to build applications for the first time that would be able to do complicated mechanics, like the things that we do in the YouTube player. It was a tremendous boom for the web. It allowed a lot of video applications and videos, of course, really fundamental to all kinds of experiences. And that was a fantastic move. But it was a trade off, because when you build applications with these implementation details in mind, they become brittle. They are hard to port. They're hard to maintain. And the Flash plug-in itself has a lot of portability limitations due to restrictions from those kind of implementation agreements. By contrast, the W3 expects and mandates that their specs are interoperable, that you can get multiple implementations going. And the append bytes model didn't really fit with that. So, with the Chrome media team, we worked out a system that was a little bit richer, had more of an intermediate layer, so that you could hide those interfaces. And so the media source spec became much more effective at doing that. The core feature of this is the source buffer. The source buffer is an abstraction. It's mostly a timeline. And once you create a source buffer object, you can just append media to it, after you've downloaded it by whatever means. And as you continue to append, to a certain point, the media is retained on the timeline. That's a lot like appending media to a queue in the normal situation. So to understand more about why this is so important, let's talk about an Adaptive Upswitch. The adaptive algorithm has a single goal, which is to put the best video quality in front of the user at all times. And that means that if you're watching a video, and you suddenly detect that you have additional bandwidth, the player is going to start downloading new video, not at the end of the queue, but as soon as it can, for the next adaptive checkpoint. And normally what you would expect is you just drop that new media in there, and it will swap out for the media that is playing at 10 seconds on this diagram. But underneath, the decoder might have actually started pulling data far ahead. On Chrome, for instance, this is like two or three frames, which is great. On a lot of TVs, it could be 10 seconds read ahead. And the platform needs to find a way to communicate this effectively to the client if you were using it in a pen bite style API, where you're doing all this manual management. On the other hand, the source buffer abstraction is capable of keeping track of all this for you. So all you do is drop in the media, and source buffer takes care of knowing when to splice, and retaining the existing information so that the decoder is pulling from exactly the right samples. And then once that process is done, it'll swap in exactly what you told it to, so that way the representation in the browser is exactly what the application expects it to. So that's how we're doing adaptive streaming in both HTML5 and Flash. A more interesting question is what's in it for you? Why is this beneficial for our users? And on a familiar theme, the most important reason for us is speed. A lot of the existing adaptive schemes may be simpler for applications to implement, but they have a significant penalty. For example, one streaming protocol, in order to do this adaptation, requires multiple separate requests before it can get started. Some these requests are megabytes of things that aren't movies. And that cost has to be paid up front. The media source solution allows us to embed a lot of that logic, and cuts down the number of requests so that we can start making media fetches at the top of the page, because we can coordinate that with our application. Another important feature is experiments. YouTube's constantly running experiments. At any given time, you can expect hundreds of experiments are going on. And because we see billions of playbacks a day, these get data and are collected and processed very quickly. And platform makers, browser manufacturers, just don't have that instrumentation. They would have to build a YouTube in order to collect that much data. So we can tune our algorithms a lot faster, release new ones, and adapt to the changing network conditions and geographic regions. Another thing we find out from all this experimentation is that there's a lot of region specific differences in what people like to see. For instance, in the United States, people would rather watch a buffer spinner than see their video adapt down to the very lowest qualities. But in other parts of the world, people would prefer the 144p. They're like, don't even give me the 360P. I want this to start quickly. So you have to build these differences in a way that you can see the responsiveness, and that requires this kind of online continuous tuning and experimentation. Another surprising thing we learned from this process is that software has bugs. Some bugs are more surprising than others. For instance, if a popular mobile handset manufacturer releases a software update that opens TCP connections and doesn't close them, you get something that looks like this. A globally distributed denial of service attack on your servers that lasts for months. And there's basically nothing you can do to avoid this situation except sit and wait and hope they release an update. That means that we have to make the choice between degrading service for all users, or cutting off a large fraction of users. And we never want to make that choice. With application level streaming, we can deliver a fix or a work around in hours, instead of months. And finally, there are a lot of new features that this unlocks. One that's great for our users is pre-loading the video, so we can make sure that videos on the page start even faster. By downloading exactly how much video we need without going over. We couldn't enable this on a global basis, because some browsers will actually start downloading the whole video. So having this application level control gives us the assurance that we're doing the right thing by the user. I'm really excited about global load balancing, but that is an internal feature, and the only effective response that you'll see is that things will be much more stable over time. You won't encounter these hiccups where your video connection suddenly gets slow because one link is congested. So being able to see this, as we mentioned, Flash, it is live at 100 percent. HTML5, the media source APIs are implemented in Chrome, and we look forward to rolling those out. You can join the HTML5 experiment to see how things work today. Or just stay put, and in the next couple of months, you'll see this HTML5 experience shipping natively on Chrome and then on any other browsers that pick up this API. And we're also-- a lot of our TV and console applications are built on web technologies. So this same player will appear there, and handsets will use the same streaming strategies. So thank you for watching both this presentation and YouTube. And if you all have any questions, we'd love to hear them. [AUDIENCE CLAPS] AUDIENCE: Hi. What is it for us like developers that need to give our users content? I mean, are you making any of these available for us to use on our own websites instead of YouTube? STEVEN ROBERTSON: The media source extensions? The media source is a new web API. So anybody can use it. But what we expect to happen, and have seen some activity around, is that content distribution networks will actually be doing the same kind of experimentation and produce algorithms that are tuned for their networks. So what they'll expose to users is a simple player, a JavaScript thing. AUDIENCE: Mmm hmm. STEVEN ROBERTSON: You just drop it in, and that JavaScript takes care of all of the adaptive logic. But of course, being a web technology, if you want to get in there and actually implement your own stuff, more power to you. You definitely can. AUDIENCE: Thank you. MATT WARD: Of course, you're also welcome to use our video player, which will just do this all for free. AUDIENCE: Yeah. AUDIENCE: Hi. Great presentation, thanks. STEVEN ROBERTSON: Thank you. AUDIENCE: I'm wondering, why is there no always play in HD option on YouTube? There's one that is play in HD when you're in full screen, but what happens all the time when I watch videos, I launch the video, put it in full screen, and I've got like the 15 or 20 first seconds not in HD. And it's like, OK. I would always use the HD option if it was there, but it's not. Is there a reason for that? MATT WARD: So I think part of what you're asking here is-- is your question more why am I not getting HD right away when I go full screen or is it more-- AUDIENCE: So that I understood, thanks to the presentation, it's why you do not let users actually always have HD all the time if they want to. MATT WARD: So you remember how the pipes are getting really clogged? So if we did that the pipes would be even more clogged and nobody would be able to get through. So we're doing the best thing here for a number of reasons. For that reason we're conservative about how we're using bandwidth on the internet. In addition, we're never actually going to send you a 720P video into a little 360P screen. This actually puts a lot more strain on your CPU and your local processing, and will drain your battery faster. We're thinking about things like that and trying to pick the right quality for the size screen that you have, as well as account for certain network conditions that you might be seeing. AUDIENCE: Thanks. MATT WARD: Yep. AUDIENCE: Hi. I had a question about rate estimation, because in a mobile device your radio changes very fast. How do you guys do that, and how fast can you do it? STEVEN ROBERTSON: Some of the more interesting strategies for rate estimation on mobile devices involve packet pacing to actually look at the interarrival time. WebRTC uses this strategy. And TCP doesn't have the-- at the web level, TCP doesn't have the exposure to measure those APIs. So one strategy we're exploring is to perform some client server cooperation on TCP window clamping to get an estimate of this. But the short answer is it's a hard problem, and we're still looking in to how to solve it. It's definitely a challenge. It comes down to being more conservative on mobile. When you detect that you're on a mobile network, there's a little bit more conservativeness as just our fallback until a better solution is presented. AUDIENCE: Thanks. AUDIENCE: On iOS, certainly, and also on Chrome on Android, at least on recent versions of Android, you live stream using HLS, which is built into mobile Safari, and I assume it's also built into Chrome on Android. So is that application level streaming or is that system level streaming? MATT WARD: That's system level streaming. That's actually exactly what we we're talking about on that slide. That was a picture of how HLS works. So basically, on these devices today, we don't have access to the APIs that we talked about in the second half of this talk. Those APIs don't exist. And for that reason today we are still doing live streaming via HLS, because we really want to hit all of the devices on the market. We really want you to have live streaming on your iPad and your iPhone and your Android. So that's really why we're doing HLS today on those devices, and that's why you see that. AUDIENCE: So you're going to switch over or what is the road map? MATT WARD: Yeah. When the API becomes available on all these devices, we would ideally like to see our applications having full control over streaming during-- we want that application level control of streaming live. We are fully capable of doing it today on the desktop, but again, for portability reasons, we're still using HLS. AUDIENCE: So on iOS it would be implemented in your Chrome implementation somehow? MATT WARD: Yeah. I mean Chrome, we hope, is finishing bringing media source to our mobile Chrome very shortly, is my understanding. I don't work on the Chrome team, so I can't actually say what the timeline of that is, but yeah, it's coming. AUDIENCE: I would like to follow up on the HD question. On my mobile device, I totally get I don't want my battery to run down, I don't things to get hot. But, say on the pixel, for example, when I've got my LTE connection with 20 plus megabit download speeds, and I've got a video that I want to-- I want to see the quality. I'm that kind of person. So I prefer to wait. But it's coming instantly. What protocols are in place that makes it go to a lower quality setting whenever I've got plenty bandwidth and plenty CPU processing power? MATT WARD: One thing you might actually have is a lot of bandwidth to a lot of the network. So you might have a great, great connection to wherever you're doing your speed test to. You might not actually have a great connection to our servers, so we might not actually have a 20 megabit connection to your device on an LTE network. So the number that we're actually seeing is the one that is informing our stream selection logic. So from our perspective, and from a lot of our experiments, we value smooth play backs and fast play backs. We're talking, actually, there's a lot of heated discussion about this exact thing internally right now. And we're trying to figure out exactly how we can give more fine grain quality controls for people like you that want really to crank it up if I can. And today, our best answer to that is, go full screen. And these algorithms are going to do their best. AUDIENCE: OK, thank you. And I will ask you right now. I request that there be an option for HD playback, no matter what is going on. MATT WARD: I will accept your request. [AUDIENCE LAUGHS] MATT WARD: Well, thank you guys very much for coming today, and enjoy the rest of the activities this week. [AUDIENCE CLAPS]