8 1 Randomized Selection Algorithm 22 Min

I said pretty much everything I wanna say about sorting at this point but I do wanna cover one more related topic, namely the selection problem. This is a problem of computing order statistics of an array. With computing the median of an array being a special case. Analogous to our coverage of quick sort the goal is going to be the design and analysis of a super practical randomized algorithm that solves the problem. And this time we'll even achieve an expected running time that is linear in the length of the input array. That is bigger of N for input array of length N as supposed to the O of N login time that we had for the expected running time of quick sort. Like quick sort the [inaudible]. Our analysis is also going to be, quite elegant. So in addition these two required video's on this very practical algorithm will motivate two optional video's that are on very cool topics but of a similar more theoretical nature. The first optional video is going to be on how you solve the selection problem in deterministic linear time. That is without using randomization. And the second optional video will be a sorting lower bound. That is why no comparison [inaudible] sort can be better than merSort. Can have better running time. [inaudible] and login. So a few words about what you should have fresh in your mind before you watch this video. I'm definitely assuming that you've watched the quick sort videos. And not just watched them, but that you have that material pretty fresh in your mind. So in particular the video of quick sort about the partition subroutine, so this is where you take an input array, you choose a pivot, and you do by repeated swaps. You rearrange the array so that everything less than the pivot's to the left of it, everything bigger than the pivot is to the right of it. You should remember that sub-routine. You should also remember the previous discussion about pivot choices, the idea that the quality of a pivot depends on how balanced a split into two different sub-problems it gives you. Those are both going to be important. For the analysis of this randomized linear time selection algorithm, I need you to be, remember the concepts from probability review part one, in particular, random variables, their expectation, and linearity of expectation. That said let?s move on and formally define what the selection problem is. The input is the same as for the sorting problem; just you're given an array. Of indistinct entries [sound]. But in addition, you're told what order statistic you're looking for. So that's gonna be a number I, which is an integer between one and n. And the goal is to output just a single number, mainly the I-th order statistic. That is the I-th smallest entry in this input array. So just to be clear, we have an array entry of, let's just, say four elements. Continue the numbers ten, eight, two, and four, and you were looking for say the third or statistic, that would be this eight. The first order statistic is just the minimum elent, element of the array. That's easy to find with a linear scan. The Nth order statistic is just the maximum. Again easier, easy to find with a linear scan. The middle element is the median and you should think of that as the canonical version of the selection problem. Now, when n is odd, it's obvious what the median is. That's just the middle element, so the n + one over 2-th order statistic. If the array has even link, there's two possible mediums, so let's just take the smaller of them. That's the end over truth order statistic. You might wonder why you'd ever want to compute the median of an array rather than the mean, that is, the average. It's easier to see that you can compute the average with a simple linear scan. And the median you can, one motivation is it's a more robust version of the mean. So if you just have a data entry problem and it corrupts one element of an input array, it can totally screw up the average value of the array. But it has generally very little impact on the median. A final comment about the problem is I am going to assume that the array entries are distinct. That is, there's no repeated elements. But just like in our discussions of sorting, this is not a big assumption. I can encourage you to think about how to adapt these algorithms to work even if the arrays do have duplicates. You can indeed still get the same very practical, very fast algorithms with duplicate elements. Now, if you think about it, we already have a pretty darn good algorithm that solves the selection problem. Here's the algorithm, its two simple steps and it runs in O of N login time. Step one, sort the input array. We have various subroutines to do that. Let's say we pick merge sort. Now what is it we're trying to do? We're trying to find the Ith smallest element in the input array. Well, once we've sorted it, we certainly know where the Ith smallest element is. It's in the Ith position of the sorted array. So that's pretty cool. We've just done what a computer scientist would call a reduction, and that's a super useful and super fundamental concept. It's when you realize that you can solve one problem. By reducing it to another problem that you already know how to solve. So what we just showed is that the selection problem reduces easily to the sorting problem. We already know how to solve the sorting problem N log N time, so that gives us an N log N time solution to the selection problem. But, again, remember the mantra of any algorithm designer worth their salt is can we do better. We should avoid continents. Just because we got N log N doesn't mean we can stop there. Maybe we can be even faster. Now certainly we gonna have to look at all the elements in the input array in the worst case. We shouldn't expect to do better than linear, but, Hey Why not linear time? Actually, if you think about it, we probably should have asked that question back when we were studying the sorting problem. Why were we so content with the N log N time bound for Merge Short? And the O of N log N time on average bound for Quick Sort. Well, it turns out we have a really good reason to be happy with our N log N upper bounds for the sorted problem. It turns out, and this is not obvious, and will be the subject of an optional video. You actually can't sort an input array of length N better than N log N time. Either in the worse case, or on average. So, in other words, if we insist on solving the selection problem via a reduction to the sorting problem then we're stuck with this n log n time amount. Okay, strictly speaking that's for something called comparison sorts; see the video for more details. But the upshot is if we want a general purpose algorithm and we wanna do better than n log n for selection, we have to do it using ingenuity beyond this reduction. We have to prove that selection is a strictly easier problem. Then sort it. That's the only way we're gonna have an algorithm that beats n log n. It's the only way we can conceivably get a linear time algorithm. And, that is exactly what is up next on our plates. We're going to show selection is indeed fundamentally easier than sorting. We can have a linear time algorithm for it, even though we can't get a linear time algorithm for sorting. You can think of the algorithm we're gonna discuss as a modification of quick sort, and in the same spirit of quick sort, it will be a randomized algorithm, and the running time will be an expected running time that will hold for any input array. Now, for the sorting problem, we know that quick sort, that n log n time on average for the average is over the coin flips done by the code. But we also know that if we wanted to we could get a sorting algorithm in n log n time that doesn't use randomization. The merge sort algorithm is one such solution. So, if you were giving a liner time solution for a selection for finding order statistics that uses randomization. And we'd be natural to wonder is there an analog to merge sort? Is there an algorithm which does not use randomization and gets this exact same linear time [inaudible]? In fact, there is. The algorithm's a little more complicated and there. For not quite as practical as this randomize algorithm. But, it's still very cool. It's a really fun algorithm to learn and to teach. So, I will have an optional video about linear time selection without randomization. So, for those of you who aren't gonna watch that video or wanna know what's the key idea the idea is to choose the pivots deterministically in a very careful way using a trick called the median of medians. That's all I'm gonna say about it now. You should watch the optional video if you want more details. I do feel compelled to warn you that if you're going to actually implement a selection algorithm, you should do the one that we discuss in this video and not the linear time one because the one we'll discuss in this video has both smaller constants and works in place. So, what I want to do next is develop the idea that we can modify the quick sort paradigm in order to directly solve the selection problem. So, to get an idea of how that works, let me review the partition subroutine. Like in quick sort. This subroutine will be our workhorse for the selection algorithm. So what the partition subroutine does is it takes and inputs some jumbled up array and it's going to. Solve a problem which is much more modest than sorting. So, in partitioning, it's going to first choose a pivot element, somehow. We'll have to discuss what's a good strategy for choosing a pivot element. But suppose, you know, in this particular input array, it chooses the first element, this three, as the pivot element. The responsibility of the partition subroutine, then, is to rearrange the elements in this array so that the following properties are satisfied. Anything less than the pivot is to the left of it. It can be in jumbled order but if you're less than the pivot, you'd better be to the left like this two and one is less than the three. If you're bigger than the pivot, then again you can be in jumbled order amongst those elements but all of them have to be to the right of the pivot, and that's true for the numbers four through eight, they all are to the right of the pivot three in a jumbled order. So this in particular puts the pivot in its rightful position where we belong in the final sorted array and at least for quick sort, it enabled. Us to recursively sort to smaller sub problems, so this is where I want you to think a little bit about how we should adapt this paradigm. So supposed I told you the first step of our selection algorithm is going to be to choose a pivot and partition the array, now the question is. How are we going to recurs? We need to understand how to find the Ith order statistic of the original input array. It suffices to recurs on just one sub problem. Of smaller size. And find a suitable order statistic in it. So how should we do that? Let me ask you that in, with some very concrete examples, about what pivot we'd choose, and what order statistic we're looking for, and see what you think. So the correct answer to this quiz is the second answer. So we can get away with recursing just once, and in this particular example we're going to recurs on the right side of the array. And instead of looking for the fifth order statistic like we were originally. We're going to recursively search for the second order statistic. So, why is that? Well, first, why do we recurs on the right side of the array? So, by assumption we have an array with ten elements. We choose the pivot, we do partitioning. Remember the pivot winds up in its rightful positioning. That's what partioning does. So, if the pivot winds up in the third position, we know it's the third smallest element in the array. Now that's not what we were looking for. We were looking for the fifth smallest element in the array. That of course is bigger. Than the third smallest element of the array, so by partitioning, where is the fifth element gonna be, it's gotta be to the right, of this third smallest element, to the right of the pivot, so we know for sure. That the fifth order statistic of the original array lies to the right of the pivot. That is guaranteed. So we know where to recurs on the right-hand side. Now. What are we looking for? We are no longer looking for the fifth order statistic, the fifth smallest element. Why? Well, we've thrown out both the pivot, and everything smaller than it. Remember, we're only recursing on the right hand side. So we've thrown out the pivot, the third element, and everything less that it. The minimum and the second minimum. Having deleted the three smallest elements, and originally looking for the fifth smallest of what remains of what we're recursing on. We're looking for the second smallest element. So, the selection algorithm, in general, is just the generalization of this idea, so arbitrary arrays. And arbitrary situations of whether the pivot comes back, equal to less than or bigger than the element you're looking for. So let me be more precise. I'm gonna call this algorithm R select for randomize selection. And according to the problem definition it takes as input as usual an array A of some length N. But then also the [inaudible] statistic that we're looking for. So we're gonna call that I. And of course we assume that I is some integer between one and N inclusive. So for the base case that's gonna be if the array has size one. Then the only element we could be looking for is the one quarter statistic and we just return the sole element of the array. Now we have to partition the array around the pivot element. And just like in quick sort we're not going to be, we're going to be very lazy about choosing the pivot. We're gonna choose it uniformly at random from the [inaudible] possibilities and hope things work out. And that'll be the crups of the analysis proving that random pivots are good enough sufficiently often. Having chosen the pivot we now just invoke that standard partitioning sub routine. As usually that's gonna give us the partition's array. You'll have the pivot element. You'll have everything less than the pivot to the left, everything bigger than the pivot to the right. As usual, I'll call everything to the left the first parts of the partitioned array. Everything bigger the second part. Now we have a couple of cases, depending on whether the pivot is bigger or less than the element we're looking for. So I need [inaudible] notation to, to talk about that. So let's let J be, the ordered statistic that P is. So if P winds up being the third smallest element, like in the quiz, then J's gonna be equal to three. Equivalently, we can think of J as defined as the position of the pivot in the partitioned version of the array. Now there's one case which is very unlikely to occur but we should include it just for completeness. If we're really lucky then in fact our random pivot just happens to be the older statistic we were looking for. That's when I equals J. We're looking for I's smallest element if by dumb luck the pivot winds up being I's smallest element, we're done. We can just return it we don't have to recurs. Now in general of course, we don't randomly choose the element we're looking for. We choose something that well, it could be bigger or could be smaller than it. In the quiz, we chose a pivot that was smaller than what we're looking for. Actually, that's the harder case. So let's first start with a case where the pivot winds up being bigger than the element we're looking for. So that means that J is bigger than I. We're looking for the Ith smallest. We randomly chose the Jth smallest for J bigger than I. So this is opposite case of the quiz. This is where we know what we're looking for has to be to the left of the pivot. The pivot's the J smallest. Everything less than it is to the left. We're looking for the I smallest. I is less than J, so that's gotta be on the left. That's where we recurs. Moreover, it's clear we're looking for exactly the same order statistic. If we're looking for the third smallest element, we're only throwing out stuff which is bigger than something even bigger than the third smallest element. So we're still looking for the third smallest of what remains. And naturally the new array size is J minus one, because that's what's is to the left of the pivot. Is when the random element that we choose is less than what we are looking for and then we're just like the quiz. So namely what we are looking for is bigger than the pivot, it's got to be on the right hand side, we know we've got recurs on the right hand side. Into the right hand side has n-j elements we threw out everything up to the pivots. We threw out J things that's n-j left and of all of the j things we threw out are less than what we are looking for. So what we use to be looking for is I's smallest element now we are looking for the I-J's smallest element. So that is the whole algorithm. That is how we adopt the approach we took toward the sorting problem in Quick Sort, and adapt it to the problem of selection. So, is this algorithm any good? Let's start studying its properties, and understand how well it works. So let's begin with correctness. So the claim is that, no matter how the algorithm's coin flips come up, no matter what random pivots we choose, the algorithm is correct, in the sense that it's guaranteed to output the [inaudible] order statistic. The proof is by induction. It precedes very similarly to Quick Sort, so I'm not gonna give it here. If you're curious about how these proofs go, there's an optional video about the correctness of Quick Sort. If you watch that and understand it, it should be clear how to adapt that inductive argument, to apply to this selection algorithm as well. So as usual for divide-and-conquer algorithms, the interesting part is not so much knowing, understanding why the algorithm works, but rather understanding how fast it runs. So the big question is, what is the running time of this selection algorithm? Now to understand this, we have to understand the ramifications of pivot choices. On the running time. So, you've seen the Quick Sort videos, they're fresh in your mind. So what should be clear is that, just like in Quick Sort, how fast is algorithm one is going to depend on how good the pivots are, and what good pivots means is pivots that guarantee a balanced split. So in the next quiz, we'll make sure that you understand this point, and ask you to think about just how bad this selection algorithm could be if you get extremely unlucky in your pivot choices. So the correct answer to this question is exactly the same as the answer for Quick Sort. The worst case running time, if the pivots are chosen, just, in a really unlucky way, is actually quadratic in the [inaudible] length. Remember, we're shooting for linear time. So this quadratic is a total disaster. So how could this happen? Well, suppose you're looking for the median. And suppose you choose the minimum element as the pivot every single time. So if this is what happens if every time you choose the pivot to be the minimum, just like in quick sort this means every time you rehearse all you succeed in doing is peeling off a single element from the in putter ray, now you're not going to find the medium element until you've done roughly N over two rehearse of cause, each on an array that has size at least a constant fraction of the original one, so that it's a linear number of rehearsive calls, each on an array of size at least some constant times N, so that gives you a total running time of quadratic overall. Of course, this is an absurdly unlikely event. Frankly, your computer is more likely to be struck by a meteor than it is for the pivot to be chosen as the minimum element in every recursive call. But, if you really had an absolutely worst case choice of pivots, it would give this quadratic run time bound. So the upshot then is that the running time of this randomized selection algorithm depends on how good our pivots are and for worst case choice of pivots, the running time can be as large as N-squared. Now hopefully, most of the time we're gonna have much better pivots and so the analysis proceeds on making that idea precise. So the key to a fast running time is going to be the, the usual property that we want to see in divide and conquer algorithms. Mainly every time recourse, every time they recourse, the problem size better not just be smaller, but it better be smaller by a significant factor. How would that happen in this selection approach based on the partition subroutine; well if both of the sub problems are not too big, then we're guaranteed that when we recourse we make a lot of progress. So let's think about what the best possible pivot would be in the sense of giving a balanced split. Right? So, of course, in some sense the best pivot is you just choose the order statistic you're looking for and that, then you're done in constant time but that's extremely unlikely and it's not worth worrying about. So ignore the fact that we might guess the pivot. What's the best pivot if we want to guarantee an aggressive decrease in the input size before the next generation. Well. The best pivot's the one that gives us as most balanced split as possible. So what's the pivot that gives us the most balanced split, a 50-50 split? Well, if you think about it, it's exactly the median. Of course, this is not super helpful, because the median might well be what we're looking for in the first place. So this is, sort of, a circular idea. But for intuition, it's still worth exploring what kind of running time we would get in the best case, right. If we're not going to get linear time even in this magical best case, we certainly wouldn't expect to get it on average over random choices of the pivots. So what would happen if we actually did luckily choose the median as the pivot every single time? Well, we get the recurrence that the running time that the algorithm requires on an array of length N. Well, there's only gonna be one recursive call. So this is the big difference from Quick Sort, where we had to recurs on both sides, and we had two recursive calls. So here, we're gonna have only one recursive call. In the magical case where our pivots are always equal to the median, both sub problems sizes contain, are only half as large as the original one. So when we recurs, it's not a problem size guaranteed to be, at most, N over two. And then outside the recursive call, pretty much all we do is a partitioning indication. And we know that that's linear time. So the. Recurrence we get is T of n is at most T of n over two, plus big O of n. This is totally ready to get plugged into the master method. It winds up being case two of the master method. And indeed, we get exactly what we wanted, linear time. To reiterate, this is not interesting in its own right, this is just for intuition. This was a sanity check that, at least for a best case choice of pivots, we'd get what we want, a linear time algorithm, and we do. Now the question is, how well do we do with random pivots? Now the intuition, the hope is exactly as it was for Quick sort, which is that random pivots are perfectly good surrogates for the median, the perfect pivot. So having the analysis of Quick sort under our belt we're indeed random pivot do approximate very closely with the performance you'll get with best case pivots. Maybe, now we have reason to believe this is hopefully true. That said as a mathematical statement this is totally not obvious and it's going to take proof, that's the next subject for next video. But let me just be clear exactly what we're claiming. Here is the running time guaranteed the randomized selection provides. For an arbitrary input array of length N, the average running time of this randomized collection algorithm is linear, Big O of N. Let me reiterate a couple points I made for the analogous guarantee for the Quick sort algorithm. The first is that, we're making no assumptions about the data whatsoever, in particular we're not assuming that the data is random. This guarantee holds not matter what input array that you feed into this randomized algorithm, in that sense this is a totally general purpose subroutine. So, where then does this averaging come from? Where does the expectation come from? The randomness is not in the data, rather the randomness is in the code and we put it there ourselves. Now let's proceed to the analysis.