Lecture 13 - Validation

ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about regularization, which is a very important technique in machine learning. And the main analytic step that we took is to take a constrained form of regularization, where you explicitly forbid some of the hypotheses from being considered, and thereby reducing the VC dimension and improving the generalization property, to an unconstrained version which creates an augmented error in which no particular vector of weights is prohibited per se, but basically you have a preference of weights based on a penalty that has to do with the constraint. And that equivalence will make us focus on the augmented-error form of regularization, in every practice we have. And the argument for it was to take the constrained version and look at it, either as a Lagrangian which would be the formal way of solving it, or as we did it in a geometric way, to find a condition that corresponds to minimization under a constraint, and find that this would be locally equivalent to minimizing this in an unconstrained way. Then we went to the general form of a regularizer, and we called it Omega of h. And it depends on small h rather than capital H, which was the other Omega that we used in the VC analysis. And in that case, we formed the augmented error as the in-sample error, plus this term. And the idea now is that the augmented error will be a better thing to minimize, if you want to minimize the out-of-sample error, rather than that-- just minimizing E_in by itself. And there are two choices here. One of them is the regularizer Omega, weight decay or weight elimination, or the other forms you may find. And the other one is lambda, which is the regularization parameter-- the amount of regularization you're going to put. And the long and short of it is that the choice of Omega in a practical situation is really a heuristic choice, guided by theory and guided by certain goals, but there is no mathematical way in a given practical situation to come up with a totally principled omega. But we follow the guidelines, and we do quite well. So we make a choice of Omega towards smoother or simpler hypotheses. And then we leave the amount of regularization to the determination of lambda, and lambda is a little bit more principled. We'll find out that we will determine lambda using validation, which is the subject of today's lecture. And when you do that, you will get some benefit of Omega. If you choose a great Omega, you will get a great benefit. If you choose an OK Omega, you will get some benefit. If you choose a terrible Omega, you are still safe, because lambda will tell you-- the validation will tell you-- just take lambda equal to 0, and therefore no harm done. And as you see, the choice of lambda is indeed critical, because when you take the correct amount of lambda, which happens to be very small in this case, the fit, which is the red curve, is very close to the target, which is the blue. Whereas if you push your luck, and have more of the regularization, you end up constraining the fit so much that the red-- it wants to move toward the blue, but it can't because of the penalty, and ends up being a poor fit for the blue curve. So that leads us to today's lecture, which is about validation. Validation is another technique that you will be using in almost every machine learning problem you will encounter. And the outline is very simple. First, I'm going to talk about the validation set. There are two aspects that I'm going to talk about. The size of the validation set is critical. So we'll spend some time looking at the size of the validation set. And then we'll ask ourselves, why did we call it validation in the first place? It looks exactly like the test set that we looked at before. So why do we call it validation? And the distinction will be pretty important. And then we'll go for model selection, a very important subject in machine learning. And it is the main task of validation. That's what you use validation for. And we'll find that model selection covers more territory than what the name may suggest to you. Finally, we will go to cross-validation, which is a type of validation that is very interesting, that allows you, if I give you a budget of N examples, to basically use all of them for validation, and all of them for training, which looks like cheating, because validation will look like a distinct activity from training, as we will see. But with this trick, you will be able to find a way to go around that. Now, let me contrast validation with regularization, as far as control of overfitting is concerned. We have seen, in one form or another, the following by-now-famous equation, or inequality or rule, where you have the out-of-sample error that you want equal the in-sample error, or at most equal the in-sample error, plus some penalty. Could be penalty for model complexity, overfit complexity, a bunch of other ways of describing that. But basically, this tells us that E_in is not exactly E_out. That, we know all too well. And there is a discrepancy, and the discrepancy has to do with the complexity of something. An overfit penalty has to do with the complexity of the model you are using to fit, and so on. So in terms of this equation, I'd like to pose both regularization and validation as an activity that deals with this equation. So what about regularization? We put the equation. What did regularization do? It tried to estimate this penalty. Basically, what we did is concoct a term that we think captures the overfit penalty. And then, instead of minimizing the in-sample, we minimize the in-sample plus that, and we call that the augmented error. And hopefully, the augmented error will be a better proxy for E_out. That was the deal. And we notice that we are very, very inaccurate in the choice here. We just say, smooth, pick lambda, you can use this, you can use that. So obviously, we are not satisfying any equality by any chance. But we are basically getting a quantity that has a monotonic property, that when you minimize this, this gets minimized, which does the job for us. Now, to contrast this, let's look at validation, when it's dealing with the same equation. What does validation do? Well, validation cuts to the chase. It just estimates the out-of-sample. Why bother with this analysis, and overfit, and this and that? You want to minimize the out-of-sample? Let's estimate the out-of-sample, and minimize it. Obviously, it's too good to be true, but it's not totally untrue. Validation does achieve something in that direction. So let me spend a few slides just describing the estimate. I'm trying to estimate the out-of-sample error. This is not completely a foreign idea to us, because we use a test set in order to do that. So let's focus on this, and see what are the parameters involved in estimating the out-of-sample error. Let's look at the estimate. The starting point is to take an out-of-sample point x, y. This is a point that was not involved in training. We used to call it test point. Now we are going to call it validation point. It's not going to become clear why we are giving it a different name for a while, until we use the validation set for something, and then the distinction will become clear. But as far as you are concerned now, this is just a test point. We are estimating E_out, and we will just read the value of E_out and be happy with that, and not do anything further. So you take this point, and the error on it is the difference between what your hypothesis does on x, and what the target value is, which is y. And what is the error? We have seen many forms of the error. Let's just mention two to make it concrete. This could be a simple squared error. We have seen that in linear regression. It could be the binary error. We have seen that in classification. So nothing foreign here. Now, if you take this quantity, and we are now treating it as an estimate for E_out, a poor estimate, but nonetheless an estimate. We call it an estimate because, if you take the expected value of that with respect to the choice of x, with the probability distribution over the input space that generates x, what will that value be? Well, that is simply E_out. So indeed, this quantity, the random variable here, has the correct expected value. It's an unbiased estimate or E_out. But unbiased means that it's as likely to be here or here, in terms of expected value. But we could be this, and this would be a good estimate, or we could be this, and this would be a terrible estimate. Because you are not getting all of them. You are just getting one of them. So if this guy swings very large, and I tell you this is an estimate of E_out, and you get it here, this is what you will think E_out is. So there is an error, but the error is not biased. That's what this equation says. But we have to evaluate that swing, and the swing is obviously evaluated by the usual quantity, the variance. And let's just call the variance sigma squared. It depends on a number of things, including what is your error measure and whatnot, but that is what a single point does. So you get an estimate, but the estimate is poor because it's one point, and therefore sigma squared is likely to be large. So you are unlikely to use the estimate on one point as your guide to E_out. What do you use? You move from one point, to a full set. So you get what? You get a validation set that you are going to use to estimate E_out. Now, the notation we are going to have is that the number of points in the validation set is K. Remember that the number of points in the training set was N. So this will be K points, also generated according to the same rules-- independently, according to the probability distribution over the input space. And the error on that set we are going to call it E_val, as in validation error. So we have E_in, and we have E_out. Now we are introducing another one, E_val, the validation error. And the form for it is what you expect it to be. You take the individual errors on the examples, and you take the average, like you did with the training set, and this one is the validation error. The only difference is that this is done out-of-sample. These guys were not used in training, and therefore you would expect that this would be a good estimate for the out-of-sample performance. Let's see if it is. What is the expected value of E_val, the validation error? Well, you take the expected value of this fellow. The expectation goes inside. So the main component is the expected value of this fellow, which we have seen before-- expected value on a single point. And you just average linearly, as you did. Now, this quantity happens to be E_out. The expected value on one point is E_out. Therefore, when you do that, you just get E_out again. So indeed, again, the validation error is an unbiased estimate of the out-of-sample error, provided that all you did with the validation set is just measure the out-of-sample error. You didn't use it in any way. Now, let's look at the variance, because that was our problem with the single-point estimate, and let's see if there's an improvement. When you get the variance, you are going to take this formula. And then you are going to have a double summation, and have all cross terms of e between different points. So you will have the covariance between the value for k equals 1 and k equals 2, k equals 1 and k equals 3, et cetera. And you also have that diagonal guys, which is the variance in this case, with k equals 1 and k equals 1 again. The main component you are going to get are the variance, and a bunch of covariances. Actually, there are more covariances than variances, because the variances are the diagonal, the covariances are the off-diagonal. There are almost K squared of them. The good thing about the covariance in this case is that it will be 0, because we picked the points independently. And therefore, the covariance between a quantity that depends on these points will be 0. So I'm only stuck with the diagonal elements, which happen to have this form. I have the variance here. And when I put the summation, something interesting happens. I have the summation again, a double summation reduced to one, because I'm only summing the diagonal. But I still have the normalizing factor with the number of elements. Because I had K squared elements, the fact that many of them dropped out is just to my advantage. I still have the 1 over K squared, and that gives me the better variance for the estimate based on E_val, than on a single point. This is your typical analysis of adding a bunch of independent estimates. So you get the sigma squared. That was the variance on a particular point. But now you divide it by K. Now we see a hope, because even if the original estimate was this way, maybe we can have K big enough that we keep shrinking the error bar, such that the E_val itself as a random variable becomes this, which is around E_out-- what we want. And therefore, it becomes a reliable estimate. This looks promising. Now we can write the E_val, which is a random variable, to be E_out, which is the value we want, plus or minus something that averages to 0, and happens to be the order of approximately 1 over square root of K. If the variance is 1 over K, then the standard deviation is 1 over square root of K. I'm assuming here that sigma squared is constant in the range that I'm using. And therefore, the dependency on K only comes from here. Therefore, I have this quantity that tells me this is what I'm estimating, and this is the error I'm committing, and this is how the error is behaving as I increase the number K. The interesting point now is that K is not free. It's not like I tell you, it looks like if I increase K, this is a good situation. So why don't we use more and more validation points? Because the reality is, K not given to you on top of your training set. What I give you is a data set, N points, and it's up to you to use how many to train, and how many to validate. I'm not going to give you more, just because you want to validate. So every time you take a point for validation, you are taking it away from training, so to speak. Let's see the ramifications of this regime. K is taken out of N. So let's now have the notation. We are given a data set D, as we always called it, and it has N points. What do we do with it? We are going to take K points, and use them for validation. And you can take any K points, as long as you don't look at the particular input and output. Let's say you pick K points at random, from the N points. That will be a valid set of validation for you. So I have the K points, and therefore I'm left with N minus K for training. The ones I left for training, I'm going to call D_train. I didn't have to use that when I didn't have validation, because D all went to training, so I didn't need to have the distinction. Now, because I have two utilities, I'm going to take the guys that go into training and call that subset D_train. And the guys that I hold for validation I'm going to call it D_val. The union of them is D. That's the setup. Now, we looked in the previous slide at the reliability of the estimate of the validation set. And we found that this reliability, if we measure it by the error bar on the fluctuation, it will be the order of 1 over square root of K, the number of validation points. Then our conclusion is that if you use small K, you have a bad estimate. And the whole role we have for validation so far is estimate, so we are not doing a good job. So we need to increase K. It looks like a good idea, just from that point of view, to take large K. But there are ramifications for taking large K, so we have a question mark. And let's try to be quantitative about it. Remember this fellow? That was the learning curve. What did it do? It told us, as you increase the number of training points, what is the expected value of E_out and what is the expected value of E_in, for a given model, the model that I'm plotting the learning curves of. Right? Now, the number of data points used to be N. Here I'm writing it as N minus K. Why am I doing that? Because under the regime of validation, this is what I'm using for training. Therefore if you increase K, you are moving in this direction, right? I used to be here, and I used to expect that level of E_out. Now I am here, and I'm expecting that level of E_out. That doesn't look very promising. I may get a reliable estimate, because I'm using bigger K, but I'm getting a reliable estimate of a worse quantity. If you want to take an extreme case, you are going to take this estimate and go to your customer, and tell them what you expect for the performance to be. So you don't only deliver the final hypothesis. You deliver the final hypothesis, with an estimate for how it will do when they test it on a point that you haven't seen before. Now, we want the estimate to be very reliable, and you forget about the quality of the hypothesis. So you keep increasing K, keep increasing K, keep increasing K. You end up with a very, very reliable estimate. The problem is that it's an estimate of a very, very poor quantity, because you used 2 examples to train, and you are basically in the noise. So the statement you are going to make to your customer in this case is that, here is a system. I am very sure that it's terrible! That is unlikely to please a customer. So now, we realize that there is a price to be paid for K. It turns out that we are going to have a trick that will make us not pay that price. But still, the question of what happens when K is big is a question mark in our mind. What I'm going to do now, I'm going to tell you, you used K to estimate the error. Now, what I'm going to tell you, why don't you restore the data set after you have the estimate? Because the estimate now is in your pocket, train on the full set, so that you get the better guy. Well, I estimated for the smaller guy. What are we doing here? Let's just do this systematically. Let's put K back into the pot. So here is the regime. I'm going to describe this figure, but let's talk it piece by piece. We have the data set D, right? We separated it into D_train and D_val. D itself has N points. We took N minus K to train, K to validate. That's the game. What happened? If we used the full training set to train, we would get a final hypothesis that we called g. This is just a matter of notation. But under the regime of validation, you took out some guys. And therefore, you are using only D_train to train. And this has N minus K. Doesn't have all the examples. Therefore, I am going to generically label the final hypothesis that I get from training on a reduced set, D_train, I am going to call it g minus. Just to remind ourselves that it's not on the full training set. So now, here is the idea, if you look at the figure. I have the D. Let me get it a bit smaller so that we can get the output. If I use the training set by itself, I would get g. What I am doing now is that I am going to take D_train, which has fewer examples, and the rest go to validation. I use D_train to get g minus, and then I take g minus and evaluate it on D_val, the validation set, in order to get an estimate. So the trick now is that instead of reporting g minus as my final hypothesis, I know if I added the other data points here to the pot, I am going to get a better out-of-sample. I don't know what it is. I don't have an estimate for it. But I know it's going to be better than the one for g minus, simply because of the learning curve. On average, I get more examples, I get better out-of-sample error. So I put it back and then report g. So it's a funny situation. I'm giving you g, and I'm giving you the validation estimate on g minus. Why? Because that's the only estimate I have. I cannot give you the estimate on g, because now if I get g, I don't have any guys to validate on. So you can see now the compromise. Under this scenario, I'm not really losing in performance by taking a bigger validation set, because I'm going to put them back when I get the final hypothesis. What I am losing here is that, the validation error I'm reporting, is a validation error on a different hypothesis than the one I am giving you. And if the difference is big, then my estimate is bad, because I'm estimating on something other than what I am giving you. And that's what happens when you have large K. When you have large K, the discrepancy between g minus and g is bigger. And I am giving you the estimate on g minus. So that estimate is poor. And therefore, I get a bad estimate again. Now, you see the subtlety here. This is the regime that is used in validation, universally. After you do your thing, and you do your estimates, and, as you will see further, you do your choices, you go and put all the examples to train on, because this is your best bet of getting a good hypothesis. If your K is small, the validation error is not reliable. It's a bad estimate, just because the variance of it is big. I have small K, it's 1 over square root of K, so I'm doing this. If you get big K, the problem is not the reliability of the estimate. The problem is that the thing you are estimating is getting further and further away from the thing you are reporting. So now we have a compromise. We don't want K to be too small, in order not to have fluctuations. We don't want K to be too big, in order not to be too far from what we are reporting. And as usual in machine learning, there is a rule of thumb. And the rule of thumb is pretty simple. That's why it's a rule of thumb. It says, take one fifth for validation. That usually gives you the best of both worlds. Nothing proved. You can find counterexamples. I'm not going to argue with that. It's a rule of thumb. Use it in practice, and actually you will be quite successful here. There's an argument with some people, whether it should be N over 5 or N over 6. I'm going not going to fret over that. It's a rule of thumb, after all, for crying out loud! We'll just leave it at that. So we now have that. Let's go to the other aspect. We know what validation is, and we understand how critical it is to choose the number, and we have a rule of thumb. Now let's ask the question, why are we calling this validation in the first place? So far, it's purely a test. We get an out-of-sample point. The estimate is unbiased. What is the deal? We call it validation, because we use it to make choices. And this is a very important point, so let's talk about it in detail. Once I make my estimate affect the learning process, the set I am using is going to change nature. So let's look at a situation that we have seen before. Remember this fellow? Yeah, this was early stopping in neural networks. And let me magnify it for you to see the green curve. Do you see the green curve now? OK, so there is a green curve. Now we'll scoot it back. So the in-sample goes down. Out-of-sample-- let's say that I have a general estimate for the out-of-sample-- goes down with it until such a point that it goes up, and we have the overfitting, and we talked about it. And in this case, it's a good idea to have early stopping. Now, let's say that you are using K points, that you did not use for training, in order to estimate E_out. That would be E_test, the test error, if all you are doing is just plotting the red in order to look at it and admire it. Oh, that's a nice curve. Oh, it's going up. But you're not going to take any actions based on it. Now, if you decide that, this is going up, I had better stop here. That changes the game dramatically. All of a sudden, this is no longer a test error. Now it's a validation error. So you ask yourself, what the heck? It's just semantics. It's the same curve. Why am I calling it a different name? I'm calling it a different name, because it used to be unbiased. That is, if this is an estimate of E_out, not the actual E_out, there will be an error bar in estimating E_out. But it is as likely to be optimistic, as pessimistic. Now, when you do early stopping, you say, I'm going to stop here and I'm going to use this value as my estimate for what you are getting. I claim that your estimate is now biased. It's the same point. You told us it was unbiased before. What is the deal? Let's look at a very specific simple case, in order to understand what happens. This is no longer a test set. It becomes, in red, a validation set. Fine, fine. Now convince us of the substance of it. We know the name. So let's look at the difference, when you actually make a choice. Very simple thing that you can reason. Let's say I have the test set, which is unbiased, and I'm claiming that the validation set has an optimistic bias. Optimism is good. But here, it's optimism followed by disappointment. It's deception. We are just calling it optimistic, to understand that it's always in the direction of thinking that the error will be smaller than it will, actually, turn out to be. So let's say we have two hypotheses. And for simplicity, let's have them both have the same E_out. So I have two hypotheses. Each of them has out-of-sample error 0.5. Now, I'm using a point to estimate that error. And I have two estimates, e_1 for the hypothesis 1, and e_2 for the hypothesis 2. I'm going to use-- because the estimate has fluctuations in it, just again for simplicity, I'm going to assume that both e_1 and e_2 are uniform between 0 and 1. So indeed, the expected value is half, which is the expected value I want, which is the out-of-sample error. Now, I'm not going to assume strictly that e_1 and e_2 are independent, but you can assume they are independent for the sake of argument. But they can have some level of correlation, and you'll still get the same effect. Let's think now that they are independent variables, e_1 and e_2. Now, e_1 is an unbiased estimate of its out-of-sample error, right? Right. e_2 is the same, right? Right. Unbiased means the expected value is what it should be. And the expected value, indeed in this case, is what it should be, 0.5. Now, let's take the game, where we pick one of the hypotheses, either h_1 or h_2. How are we going to pick it? We are going to pick it according to the value of the error. So now, the measurement we have is applying to that choice. What I'm going to do, I'm going to pick the smaller of e_1 and e_2. And whichever that one is, I'm going to pick the hypothesis that corresponds to it. So this is mini learning. The error-- just pick one, and this is the one. My question to you is very simple. What is the expected value of e? A naive thought would say, you told us the expected value of e_1 is 1/2. You told us the expected value of e_2 is 1/2. e has to be either e_1 or e_2. So the expected value should be 1/2? Of course not, because now the rules of the game-- the probabilities that you're applying-- have changed, because you are deliberately picking the minimum of the realization. And it's very easy to see that the expected value of e is less than 0.5. The easiest thing to say is that if I have two variables like that, the probability that the minimum will be less than 1/2 is 75%, because all you need to do is one of them being less than 1/2. If the probability of being less than 1/2 is 75%, you expect the expected value to be less than 1/2. It's mostly there. The mass is mostly below. So now you realize this is what? This is an optimistic bias. And that is exactly the same as what happened with the early stopping. We picked the point because it's minimum on the realization, and that is what we reported. Because of that-- the thing used to be this, but we wait. When it's there, we ignore it. When it's here, we take it. So now that introduces a bias, and that bias is optimistic. And that will be true for the validation set. Our discussion so far is based on just looking at the E_out. Now we're going to use it, and we're going to introduce a bias. Fortunately for us, the utility of validation in machine learning is so light, that we are going to swallow the bias. Bias is minor. We are not going to push our luck. We are not going to estimate tons of stuff, and keep adding bias until the validation error basically becomes training error in disguise. We're just going to-- let's choose a parameter, choose between models, and whatnot. And by and large, if you do that, and you have a respectable-size validation set, you get a pretty reliable estimate for the E_out, conceding that it's biased, but the bias is not going to hurt us too much. So this is the general philosophy. Now with this understanding, let's use validation set for model selection, which is what validation sets do. That is the main use of validation sets. And the choice of lambda, in the case we saw, happens to be a manifestation of this. So let's talk about it. Basically, we are going to use the validation set more than once. That's how we are going to make the choice. So let's look. This is a diagram. I'm going to build it up. Let's build it up, and then I'll focus on it, and look at how the diagram reflects the logic. We have M models that we're going to choose from. When I say model, you are thinking of one model versus another, but this is really talking more generally. I could be talking about models as in, should I use linear models or neural networks or support vector machines? These are models. I could be using only polynomial models. And I'm asking myself, should I go for 2nd order, 5th order, or 10th order? That's a choice between models. I could be using 5th-order polynomials throughout, and the only thing I'm choosing, should I choose lambda of the regularization to be 0.01, 0.1, or 1? All of this lies under model selection. There's a choice to be made, and I want to make it in a principled way, based on the out-of-sample error, because that's the bottom line. And I'm going to use the validation set to do that. This is the game. So we'll call them, since they're models, I have H_1 up to H_M. And we are going to use D to train, and I am going to get as a result of that-- it's not the whole set, as usual, so I left some for validation. And I'm going to get g minus. That is our convention for whenever we train on something less than the full set. But because I'm getting a hypothesis from each model, I am labeling it by the subscript m. So there is g_1 up to g_M, with a minus, because they used D_train to train. So I get one for each model. And then I'm going to evaluate that fellow, using the validation set. The validation set are the examples that were left out from D, when I took the D_train. So now I'm going to do this. All I'm doing is exactly what I did before, except I'm doing it M times, and introducing the notation that goes with that. Let's look at the figure now a little bit. Here is the situation. I have the data set. What do I do with it? I break it into two, validation and training. I use the training to apply to each of these hypothesis sets, H_1 up to H_M. And when I train, I end up with a final hypothesis. It is with a minus, a small minus in this case, because I'm training on D_train. And they correspond to the hypotheses they came from, so g_1, g_2, up to g_M. These are done without any validation, just training on a reduced set. Once I get them, I'm going to evaluate their performance. I'm going to evaluate their performance using the validation set. So I take the validation set and run it here. It's out-of-sample as far they're concerned, because it's not part of D_train. And therefore, I'm going to get estimates-- these are the validation errors. I'm just giving them a simple notation as E_1, E_2, up to E_M. Now, your model selection is to look at these errors, which supposedly reflect the out-of-sample performance if you use this as your final product, and you pick the best. Now that you are picking one of them, you immediately have alarm bells-- bias, bias, bias. Something is happening now, because now we are going to be biased. Each of these guys was an unbiased estimate of the out-of-sample error of the corresponding hypothesis. You pick the smallest of them, and now you have a bias. So the smallest of them will give the index m star, whichever that might be. So E_m star is the validation error on the model we selected, and now we realize it has an optimistic bias. And we are not going to take g_m star minus, which is the one that gave rise to this. We are now going to go back to the full data set, as we said in our regime. We are going to train with it. And from that training, which is training now on the model we chose, we are going to get the final hypothesis, which is g_m star. So again, we are reporting the validation error on a reduced hypothesis, if you will, but reporting the hypothesis-- the best we can do, because we know that we get better out-of-sample when we add the examples. So this is the regime. Let's complete the slide. E_m, that we introduced, happens to be the value of the validation error on the reduced, as we discussed. And this is true for all of them. And then you pick the model m star, that happens to have the smallest E_m. And that is the one that you are going to report, and you are going to restore your D, as we did before, and this is what you have. This is the algorithm for model selection. Now, let's look at the bias. I'm going to run an experiment to show you the bias. So let me put it here and just build towards it. What is the bias now? We know we selected a particular model, and we selected it based on D_val. That's the killer. When you use the estimate to choose, the estimate is no longer reliable, because you particularly chose it, so now it looks optimistic. Because by choice, it has a good performance, not because it has an inherently good performance. Because you looked for the one with the good performance. So the expected value of this fellow is now a biased estimate of the ultimate quantity we want, which is the out-of-sample error. So E_val, the sample thing, is biased from that. And we would like to evaluate that. Here is the illustration on the curve, and I'm going to ask you a question about it, so you have to pay attention in order to be able to answer the question. Here is the experiment. I have a very simple situation. I have only two models to choose between. One of them is 2nd-order polynomials, and the other one is 5th-order polynomials. I'm generating a bunch of problems, and in each of them, I make a choice based on validation set. And after that, I look at the actual out-of-sample error. And I'm trying to find out whether there is a systematic bias in the one I choose, with respect to its out-of-sample error. So it's not clear which-- I'm saying that I chose H_2 or H_5. In each run, I may choose H_2 sometimes and H_5 sometimes, whichever gave me the smaller E_val. And I'm taking the average over an incredible number of runs. That's why you have a smooth curve. So this will give me an indication of the typical bias you get when you make a choice between two models, under the circumstances of the experiment. Now the experiment is done carefully, with few examples. The total is 30-some examples. And I'm taking a validation set, which is 5 examples, 15 examples, up to 25 examples. So at this point really, the number of examples left for training is very small. And I'm plotting this, so this is what I get for the average over the runs, of the validation error on the model I chose-- the final hypothesis of the model I chose. And this is the out-of-sample error of that guy. Now, I'd like to ask you two questions. Think about them, and also for the online audience, please think about them. First question-- why are the curves going up? This is K, the size of the validation set. I'm evaluating it. It's not because I'm evaluating on more points that the curves are going up. It's because when I use more for validation, I'm inherently using less for training. So there's an N minus K that is going the other direction. And what we are seeing here really is the learning curve, backwards. This is E_out. I have more and more examples to train as I go here, so the out-of-sample error goes down. So in the other direction, it goes up. And this, being an estimate for it, goes up with it. So that makes sense. Second question-- why are the two curves getting closer together? Whether they're going up or down, that's not my concern at this point, just the fact that they are converging to each other. Now, that has to do with K proper, directly. The other had to do with K indirectly, because I'm left with N minus K. But now, when I have bigger K, the estimate is more and more reliable, and therefore I get closer to what I'm estimating. So we understand this. This is the definitely evidence, and in every situation you will have, there will be a bias. How much bias depends on a number of factors, but the bias is there. Let's try to find, analytically, a guideline for the type of bias. Why is that? Because I'm using the validation set to estimate the out-of-sample error, and I'm really claiming that it's close to the out-of-sample error. And we realize that, if I don't use it too much, I'll be OK. But what is too much? I want to be a little bit quantitative about it, at least as a guideline. So I have M models, and you can see that the M is in red. That should remind you when we had M in red very early in the course, because M used to make things worse. It was the number of hypotheses, when we were talking about generalization. And it was really that, when you have bigger M, you are in bigger trouble. So it seems like we are also going to be in bigger trouble here, but the manifestation is different. We have now M models we are choosing from-- models in the general sense. This could be M values of the regularization parameter lambda in a fixed situation, but we are still making one of M choices. Now, the way to look at it is to think that the validation set is actually used for training, but training on a very special hypothesis set, the hypothesis set of the finalists. What does that mean? So I have H_1 up to H_M. I'm going to run a full training algorithm on each of them, in order to find a final hypothesis from each, using D_train. Now, after I'm done, I am only left with the finalists, g_1 up to g_M, with a minus sign because they are trained on the reduced set. So the hypothesis set that I am training on now is just those guys. As far as the validation set is concerned, it didn't know what happened before. It doesn't relate to D_train. All you did, you gave it this hypothesis set, which is the final hypotheses from your previous guy, and you are asking it to choose. And what are you going to choose? You are going to choose the minimum error. Well, that is simply training. If I just told you that this is your hypothesis set, and that D_val is your training, what would you do? You will look for the hypothesis with the smallest error. That's what you are doing here. So we can think of it now as if we are actually training on this set. And this tells us, oh, we need to estimate the discrepancy or the bias between this and that. Now it's between the validation error and the out-of-sample error. But the validation error is really the training error on this special set. So we can go back to our good old Hoeffding and VC, and say that the out-of-sample error, in this case, given from those-- and now you can see that the choice here is star. So I'm actually choosing one of those guys. This is my training, and the final, final hypothesis is this guy-- is less than or equal to the out-of-sample error, plus a penalty for the model complexity. And the penalty, if you use even the simple union bound, will have that form. You still have the 1 over square root of K, so you can always make it better by having more examples. But then you have a contribution because of the number of guys you are choosing from. If you are choosing between 10 guys, that's one thing. If you are choosing between 100 guys, that's another. It's worse. Well, benignly worse, because it's logarithmic, but nonetheless, worse. And if you are choosing between an infinite number of guys, we know better than to dismiss the case off-hand. You say, infinite number of guys, we can't do that. No, no, no. Because once you go to the infinite choices, you don't count anymore. You go for a VC dimension of what you are doing. That's what the effective complexity goes with. And indeed, if you're looking for choice of one parameter, let's say I'm picking the regularization parameter. When you are actually picking the regularization parameter, and you haven't put a grid-- you don't say, I'm choosing between 1, 0.1, and 0.01, et cetera-- a finite number. I'm actually choosing the numerical value of lambda, whatever it would be. So I could end up with lambda equal 0.127543. You are making a choice between an infinite number of guys, but you don't look at it as an infinite number of guys. You look at it as a single parameter. And we know a single parameter goes with a VC dimension 1. That doesn't faze us. We dealt with VC dimensions much bigger than that. And we know that if we have one parameter, or maybe two parameters, and the VC dimension maybe is 2, if you have a decent set-- in this case decent K, not decent N, because that's the size of the set you are talking about-- then your estimate will not be that far from E_out. This is the idea. So now you can apply this with the VC analysis. Instead of just going for the number, which is the union bound, you go for the VC version, and now apply it to this fellow. And you can ask yourself, if I have a regularization parameter, what do I need? Or if I have another thing, which is the early stopping. What is early stopping? I'm choosing how many epochs to choose. Epochs is integer, but there is a continuity to it, so I'm choosing where to stop. All of those choices, where one parameter is being chosen one way or the other, correspond to one degree of freedom. So if I tell you the rule of thumb is that, when you are using the validation set, if it's a reasonable-size set, let's say 100 points, and you use those 100 points to choose a couple of parameters, you are OK. You already can relate to that. You don't need me to tell you that. Because, 100 points, VC dimension 2, yeah, I can get something. Now, if I give you the 100 points and tell you you are choosing 20 parameters, you immediately say, this is crazy. Your estimate will be completely ruined, because you are now contaminating the thing. This is now genuinely training, because the choice of the value of a parameter is what? Well, that's what training did. The training of a neural network tried to choose the weights of the network, the parameters. There were just so many of them, that we called it training. Now, when it's only one parameter or two, we call it choice of a parameter by validation. So it's a gray area. If you push your luck in that direction, the validation estimate will lose its main attraction, which is the fact that it's a reasonable estimate of the out-of-sample, that we can rely on. The reliability goes down. So there is this tradeoff. So with the data contamination, let me summarize it as follows. We have error estimates. We have seen some of them. We looked at the in-sample error, the out-of-sample error, or E_test, and then we have E_val, the validation error. I'd like to describe those as data contamination, that if you use the data to make choices, you are contaminating it as far as its ability to estimate the real performance. That's the idea. So you can look at what is contamination. It's the built-in optimistic, better described as deceptive because it's bad-- you are going to go to the bank and tell them, I can forecast the stock market. No, you can't. So that's bad. You were optimistic before you went there. After that, you are in trouble. You are trying to get the bias in estimating E_out, and you are trying to measure what is the level of contamination. So let's look at the three sets we used. We have the training set. This is just totally contaminated. Forget it. We took a neural network with 70 parameters, and we did backpropagation, and we went back and forth, and we ended up with something, and we have a great E_in, and we know that E_in is no indication of E_out. This has been contaminated to death. So you cannot really rely on E_in, as an estimate for E_out. When you go to the test set, this is totally clean. It wasn't used in any decisions. It will give you an estimate. The estimate is unbiased. When you give that as your estimate, your customer is as likely to be pleasantly surprised, as unpleasantly surprised. And if your test set is big, they are likely not to be surprised at all. It'll be very close to your estimate. So there is no bias there. Now, the validation set is in between. It's slightly contaminated, because it made few choices. And the wisdom here, please keep it slightly contaminated. Don't get carried away. Sometimes when you are in the middle of a big problem, with lots of data, you choose this parameter. Then, oh, there's another parameter I want to choose, so you use the same validation set-- alarm bells, alarm bells-- and you keep doing it. So you should have a regime to begin with, that you should have not only one validation set. You could have a number of them, such that when one of them gets dirty, contaminated, you move on to the other one which hasn't been used for decisions, and therefore the estimates will be reliable. Now we go to cross-validation. Very sweet regime, and it has to do with the dilemma about K. So now we're not talking about biased versus unbiased, because this is already behind us. Now we're looking at an estimate and a variation of the estimate, as we did before. And we have the discipline to make sure that we don't mess it up by making it biased. So that is taken for granted. Now I'm just looking at a regime of validation as we described it, versus another regime which will get us a better estimate, in terms of the error bar, the fluctuation around the estimate we want. So we had the following chain of reasoning. E_out of g, the hypothesis we are actually going to report, is what we would like to know. If we know that, we are set. We don't have that, but that is approximately the same as E_out of g minus. This is the out-of-sample error, the proper out-of-sample error, but on the hypothesis that was trained on a reduced set. Correct? And if I didn't take too many examples, they are close to each other. This one happens to be close to the validation estimate of it. So here, it is because it's a different set that I'm training on. Here, it's because I am making a finite-sample estimate of the quantity. Here, I could go up and down from this. I'm looking at this chain. This is really what I want, and this is what I'm working with. This is unknown to me. In order to get from here to here, I need the following. I need K to be small, so that g minus is fairly close to g. And therefore I can claim that their out-of-sample error is close, because the bigger K is, the bigger the discrepancy between the training set and the full set, and therefore the bigger the discrepancy between the hypothesis I get here, and the hypothesis I get here. So I'd like K to be small. But also, I'd like K to be large, because the bigger K is, the more reliable this estimate is, for that. So I want K to have two conditions. It has to be small, and it has to be large. We will achieve both. You'll see in a moment. New mathematics is going to be introduced! So here is the dilemma. Can we have K to be both small and large? The method looks like complete cheating, when you look at it first, and then you realize, this is actually valid. So what do we do? I'm going to describe one form of cross-validation, which is the simplest to describe, which is called "leave one out". Other methods will be "leave more out", that's all. But let's focus on "leave one out". Here is the idea. You give me a data set of N. I am going to use N minus 1 of them for training. That's good, because now I am very close to N, so the hypothesis g minus will be awfully close to g. That's great, wonderful, except for one problem. You have one point to validate on. Your estimate will be completely laughable, right? Not so fast. In terms of a notation, I'm going to create a reduced data set from D, call it D_n, because I'm actually going to repeat this exercise for different indices n. What do I do? I take the full data set, and then take one of the points, that happens to be n, and take it out. This will be the one that is used for validation. And the rest of the guys are going to be used for training. Nothing different, except that it's a very small validation set. That's what is different. Now the final hypothesis, that we learn from this particular set, we have to call g minus because it's not on the full set. But now, because it depends on which guy we left out, we give it the label of the guy we left out. So we know that this one is trained on all the examples but n. Let's look at the validation error, which has to be one point. This would be what? This would be E validation-- big symbol of this and that-- but in reality, the validation set is one point, so this is simply just the error on the point I left out. g did not involve the n-th example. It was taken out. And now that we froze it, we are going to evaluate it on that example, so that example is indeed out-of-sample for it. So I get this fellow. Now, I know that this guy is an unbiased estimate, and I know that it's a crummy estimate. That much, I know. Now, here is the idea. What happens if I repeat this exercise for different n? So I generate D_1, do all of this, and end up with this estimate. Do D_2, all of this-- end up with another estimate. Each estimate is out-of-sample with respect to the hypothesis that it's used to evaluate. Now, the hypotheses are different. So I'm not really getting the performance of a particular hypothesis. For this hypothesis, this is the estimate. It's off. For this hypothesis, this is the estimate. It's off. For this hypothesis, this is the estimate. The common thread between all the hypotheses is that they are hypotheses that were obtained by training on N minus 1 data points. That is common between all of them. It's different N minus 1 data points, but nonetheless, it's N minus 1. Because of the learning curve, I know there is a tendency. If I told you this is the number of examples, you can tell me what is the expected out-of-sample error. So in spite of the fact that these are different hypotheses, the fact that they come from the same number of points, N minus 1, tells me that they are all realizations of something that is the expected value of all of them. So the small errors estimate the error on these guys, and these guys estimate the error of the expected value on N minus 1 examples, regardless of the identity of the examples. So there is something common between these guys. They are trying to estimate something. So now what I'm going to do, I am going to define the cross-validation error to be-- E cross-validation, E_cv, to be the average of those guys. It's a funny situation now. These came from N full training sessions, each of them followed by a single evaluation on a point, and I get a number. And after I'm done with all of this, I take these numbers and average them. Now, if you think of it as a validation set, now all of a sudden the validation set is very respectable. It has N points. Never mind the fact that each of them is evaluated on a different hypothesis. I was able to use N minus 1 points to train, and that will give me something very close to what happens with N. And I'm using N points to validate. The catch, obviously-- these are not independent, because the examples were used to create the hypotheses, and some example was used to evaluate them. And you will see that each of them is affected by the other, because the hypothesis either has the point you left out, or you are evaluating on that. Let's say, e_1 and e_3. e_1 was used to evaluate the error on a hypothesis that involved the third example, because the third example was in, when I talk about e_1. Then e_3 was used to evaluate on the third example, but on a hypothesis that involved e_1. So you can see where the correlation is. Surprisingly, the effective number, if you use this, is very close to N. It's as if they were independent. If you do the variance analysis, you will be using out of 100 examples, it's probably as if you were using 95 examples. So it's remarkably efficient, in terms of getting that. So this is the algorithm. Now, let's illustrate it. If you understand this, you understand cross-validation. I'm illustrating it for the "leave one out". I have a case. I am trying to estimate a function. I actually generated this function using a particular target. I'm not going to tell you yet what it is. Added some noise. And I am trying to use cross-validation, in order to choose a model, or to just evaluate the out-of-sample error. So let's evaluate the out-of-sample error using the cross-validation method, for a linear model. So what do you do? First order of business, take a point that you will leave out. Right? So now, this guy is the training set, and this guy is the validation set. It's one point. Then you train. And you get a good fit. Then, you evaluate the validation error on the point you left out. That will be that. That's one session. We are going to repeat this three times, because we have three points. So this is the second time we do it. This time, this point was left out. These guys were the training. I connected them and computed the error. Third one. You can see the pattern. After I am done, I'm going to compute the cross-validation error to be simply the average of the three errors. So let's say we are using squared errors. e_1 is the squared of this distance, et cetera, and you are adding them up, one third. This will be the cross-validation error. What I am saying now is that you are going to take this as an indication for how well the linear model fits the data, out-of-sample. If you look in-sample, obviously it fits the data perfectly. And if you use the three points, the line will be something like that. It will fit it pretty decently. But you have no way to tell how you are going to perform out-of-sample. Here, we created a mini out-of-sample, in each case, and we took the average performance of those as an indication of what will happen out-of-sample. Mind you, we are using only 2 points here. And when we are done, we are going to use it on 3 points. That's g minus versus g. It's a little bit dramatic here, because 2 and 3-- the difference is 1, but the ratio is huge. But think of 99 versus 100. Who cares? It's close enough. This is just for illustration. So let's use this for model selection. We did the linear model, and we call it linear. So now let's go for the usual suspect, the constant model, exactly with the same data set. Let's look at the first guy. These are the two points left out, the two points left out, and this is the one for validation. You train on those. Here, you connected. Here, you have the middle number-- it's a constant number. And this would be you error here. Right? Second guy, you get the idea? Third guy. Now, if your question is: is the linear model better than the constant model in this case? the only thing you look in all of this is the cross-validation error. So this guy, this guy, this guy, averaged, is the grade-- negative grade, because it's error-- for the linear model. This guy, this guy, this guy, averaged, is that grade for the constant model. And as you see, the constant model wins. And it's a matter of record that these three points were actually generated by a constant model. Of course, they could have been generated by anything. But on average, they will give you the correct decision. And they avoid a lot of funny heuristics that you can apply. You can say-- wait a minute, linear model, OK. Any two points I pick, the slope here is positive. So there is a very strong indication that there is a positive slope involved, and maybe it's a linear model with a positive slope. Don't go there. You can fool yourself into any pattern you want. Go about it in a systematic way. This is a quantity we know, the cross-validation error. This is the way to compute it. We are going to take it as the indication, notwithstanding that there is an error bar because it's a small sample, in this case 3, and also because we are making the decision for 2 points, and we are using it for 3 points. These are obviously inherent, but at least it gives you something systematic. And indeed, it gives you the correct choice in this case. So let's look at cross-validation in action. I'm going to go with a familiar case. You remember this one? Oh, these were the handwritten digits, and we extracted two features, symmetry and intensity. And we are plotting the different guys, and we would like to find a separating surface. We are going to use a nonlinear transform, as we always do. And in this case, what I'm going to do, I'm going to sample 500 points from this set at random for training, and use the rest for testing the hypothesis. What is the nonlinear transformation? It's huge-- 5th order. So I am going to take all 20 features, or 21 including the constant. And what am I going to use validation for? This is the interesting part. What I'm going to use validation for is, where do I cut off? So I'm comparing 20 models. The first model is, just take this guy. Second model is, take x_1 and x_2. Third model, take x_1, x_2, and x_1 squared, et cetera. Each of them is a model. I can definitely train on it and see what happens. And I'm going to use cross-validation "leave one out", in order to choose where to stop. So if I have 500 examples, realize that every time I do this, I have to have 500 training sessions. Each training session has 499 points. It's quite an elaborate thing. But when you do this, this is the curve you get. You get different errors. Let me magnify it. This is the number of features used. This is the cutoff I talked about. You can go all the way up to 20 features. When you look at the training error, not surprisingly, the training error always goes down. What else is new? You have more, you fit better. The out-of-sample error, which I'm evaluating on the points that were not involved at all in this process, cross-validation or otherwise, just out of sample totally, I get this fellow. And the cross-validation error, which I get from the 500 examples by excluding one point at a time and taking the average, is remarkably similar to E_out. It tracks it very nicely. And if I use it as a criterion for model choice, the minima are here. So if I take between 5 and 7, let's say I take 6. I would say, let me cut off at 6 and see what the performance is like. Let's look at the result of that, without validation, and with validation. Without validation, I'm using the full model, all 20. And you can see, we have seen this before-- overfitting. I'm sweating bullets to include this single point in the middle, and after I included it, guess what? None of the out-of-sample points was red here. This was just an anomaly. So I didn't get anything for it. 1283 01:05:04,380 --> 01:05:05,870t This is a typical thing. It's unregularized. Now, when you use the validation, and you stop at the 6th because the cross-validation error told you so, it's a nice, smooth surface. It's not perfect error, but it didn't put an effort where it didn't belong. And when you look at the bottom line, what is the in-sample error here? 0%. You got it perfect. We know that. And the out-of-sample sample error? 2.5%. For digits, that's OK,. OK, but not great. Here, we went. And now the in-sample error is 0.8%. But we know better. We don't care about the in-sample error going to 0. That's actually harmful in some cases. The out-of-sample error is 1.5%. Now, if you are in the range-- 2.5% means that you are performing 97.5%. Here, you are performing 98.5%. 40% improvement in that range is a lot. There is a limit here that you cannot exceed. So here, you are really doing great by just doing that simple thing. Now you can see why validation is considered, in this context, as similar to regularization. It does the same thing. It prevented overfitting, but it prevented overfitting by estimating the out-of-sample error, rather than estimating something else. Now, let me go and very quickly-- and I will close the lecture with it-- give you the more general form. We talked about "leave one out". Seldom you use "leave one out" in real problems, and you can think of why. Because if I give you 100,000 data points, and you want to leave one out, you are going to have 100,000 sessions training on 99,999 for each, and you will be an old person before the results are out. So when you have "leave one out", you have N training sessions using N minus 1 points each, right? Now, let's consider to take more points for validation. 1 point makes it great, because N minus 1 is so close to N, that my g minus will be so close to g. But hey, 100,000, if you decided to take 100,000 minus 1,000, that's still 99,000. That's fairly close to 100,000. You don't have to make it difference 1. So what you do is, you take your data set, and you break it into a number of folds. Let's say 10-fold. So this will be 10-fold cross-validation. And each time, you take one of the guys here, that is, 1/10 in this case, use it for validation, and the 9/10, you use them for training. And you change, from one run to another, which one you take for validation. So "leave one out" is exactly the same, except that here, the 10, replace it by N. I break the thing into 1 example at a time, and then I validate on 1 example. Here, I'm taking a chunk. And therefore, you have fewer training sessions, in this case 10 training sessions, with not that much of a difference, in terms of the number of examples. If N is big, instead of taking 1, you take a few more. Now, the reason I introduced this is because this is what I actually recommend to you. Very specifically, 10-fold cross-validation works very nicely in practice. So the rule is, you take the total number of examples, divide them by 10, and that is the size of your validation set. You repeat it 10 times, and you get an estimate, and you are ready to go. That's it. I will stop here, and we'll take questions after a short break. Let's start the Q&A. And we have an in-house question. STUDENT: You told about validation, and you told that we should restrict ourselves in amount of parameters we should estimate. Do we have a rule of thumb about the number of these parameters? So is, say, K over 20 parameters reasonable for the maximum number? PROFESSOR: It obviously depends on the number of data points. So the reason why I didn't give a rule of thumb in this case, because it goes with the number of points. But let's say that if I have 100 points for validation, so it's a small data set, I would say that a couple of parameters would be fine. At least, that's my own experience. And you can afford more, when you have more. And when you have more, you can even afford more than one validation set, in which case, you use each of them for a different estimate. But the simplest thing, I would say, a couple of parameters for 100 points would be OK. MODERATOR: Can you clarify why model choice by validation doesn't count as data snooping? PROFESSOR: For the same reason that the answer is usually given for a question like that, because it is accounted for. I took the validation set, the validation set are patently out of sample, and I used them to make a choice. And when I did that choice, I made sure that the discrepancy between in-sample and out-of-sample on the validation set is very little. So we had this discussion of how much bias there is, and we want to make sure that the discrepancy is very little. So because I have already done the accounting, I can take it as a reasonable estimate for the out-of-sample. That is why. In the other case, the problem with the data snooping that I gave is that you use the data in order to make choices, and in that case, huge choices. You looked at the data and you chose between different models, and you didn't pay for it. You didn't account for it. That's where the problem was. MODERATOR: Some people recommend using cross-validation 10 times. What does that add? PROFESSOR: The regime I described, I only need to tell you 10-fold, 12-fold, 50-fold, and then the rest is fixed. So if I use 10-fold, then by definition I'm going to do this 10 times. It's not a choice, given the regime that I described. In each run, I am choosing one of the 10 to be my validation, and the rest for training, and taking the average. So the question is asking, do I do this 10 times? Inherently, built in the method is that you use it 10 times, if that's the question. MODERATOR: I think the question goes to, since you chose your 10 data sets inside, then you'd run cross-validation. What if you do it again choosing 10 subsets, and you repeat that process? PROFESSOR: There are variations. For example, even, let's say, with the "leave one out", maybe I can take a point at random, and not necessarily insist on going through all the examples-- do it like 50 times, and take the average. Or I can take subsets, like in the 10-fold, but I take random subsets and stop at some point. So there are variations of those. The ones I described are the most standard ones. But there are obviously variations. And one can do an analysis for them as well. MODERATOR: Is there any rule for separating data among training, validation, and test? PROFESSOR: Random is the only trustworthy thing. Because if you use your judgment somehow, you may introduce a sampling bias, which we'll talk about in a later lecture. And the best way to avoid that for sure, if you sort of flip coins to choose your examples, then you know that you are safe. MODERATOR: What's the computational complexity of adding a cross-validation? PROFESSOR: I didn't give the formula for it. Basically, for "leave one out", you are doing N times as much training as you did before. The evaluation is trivial. Most of the time goes for the training. So you can ask yourself, how many training sessions do I have to do now that I'm using cross-validation, versus what I had to do before? Before, you had to do one session. Here, you have to do as many sessions as there are folds. So 10-fold will be 10 times. "Leave one out" would be N, because it's really N-fold, if you want, and so on. MODERATOR: A clarification-- can you use both regularization and cross-validation? PROFESSOR: Absolutely. In fact, one of the biggest utilities for validation is to choose the regularization parameter. So inherently in those cases, you do it. You can use it to choose the regularization parameter. And then you can also use it on the side, to do something else. So both of them are active in the same problem. And in most of the practical cases you will encounter, you will actually be using both. Very seldom can you get away without regularization, and very seldom can you get away without validation. MODERATOR: Someone is asking that, this seems to be a brute force method for model selection. Is there a way to branch and bound how many hypotheses to consider? PROFESSOR: There are lots of methods for model selection. This is the only one, at least among the major ones, which does not require assumptions. I can do model selection based on, I know my target function is symmetric, so I'm going to choose a symmetric model. That can be considered model selection. And there are a bunch of other logical methods to choose the model. The great thing about validation is that there are no assumptions whatsoever. You have M models. What are the models? What assumptions do they have? How close they are, or not close to the target function-- Who cares? They are M models. I am going to take a validation set, and I'm going to find this objective criterion, which is a validation or cross-validation error, and I'm going to use it to choose. So it's extremely simple to implement, and very immune to assumptions. Obviously, if you make assumptions and you know that the assumption are valid, then you would be doing better than I am doing. But then you know that the assumptions are valid. I'm taking a case where I don't want to make assumptions, that I don't know hold, and still make the model selection. MODERATOR: In the case where the data depends on time evolution, how can validation update the model? Is it used for that, or not? PROFESSOR: Validation makes a principled choice, regardless of the nature of that choice. Let's say that I have a time series, and one of the things in time series-- let's say they're for financial forecasting-- is that, you can train, and then you get a system, and then the world is not stationary. So a system that used to work, doesn't work anymore. You can make choices about, let's say I have a bunch of models, and I want to know which one of them works at a particular time, given some conditions. You can make the model selection based on validation, and then you take that model and apply it to the real data, or there are a bunch of things you can do. But in terms of tracking the evolution of systems, again, if you translate the problem into making a choice, then you are ready to go with validation. So the answer is yes. And the method is to make it spelled out as a choice. MODERATOR: Another clarification-- so with cross-validation, there's still some bias. can you quantify why is it better than just regular validation? PROFESSOR: Both validation and cross-validation will have bias for the same reasons. The only question is the reliability of the estimate. Let's say that I use "leave one out", so here's E_out. And the bias aside, if I use "leave one out", I'm using all N of the examples eventually, when I average them. So the error bar is small. Granted, it's not as small as it would be if the N errors were independent of each other. But it's fairly close to being as if they were independent. So I get that estimate. Therefore, anytime you have this estimate, it becomes less vulnerable to bias. Because if I have this play, and I'm pulling down, I'm not going to pull down too far, because I'm still within here. If I have the other guy which is completely swinging, it's very easy to pull it down and I get worse effect of the bias. So whenever you minimize the error bar, you minimize the vulnerability to bias as well. That's the only thing that cross-validation does. It allows you to use a lot of examples to validate, while using a lot of examples to train. That's the key. MODERATOR: Going back to the previous lecture, a question on that. Can you see the augmented error as conceptually the same as a low-pass filtered version of the initial error, or not? PROFESSOR: It can be translated to that under the condition that the regularizer is a smoothness regularizer, because that's what low-pass filters do. So as an intuition, it's not a bad thing to consider in the case of something like weight decay. It's not going to be strictly low-pass as in working in the Fourier domain and cutting off, et cetera. But it will have the same effect of being smooth. If you have a question, please step to the microphone, and you can ask it. So there's a question in house. STUDENT: Yes. It seems that cross-validation is a method to deal with limited size of the data set. So is it possible in practice that we have a data set so large that cross-validation is not needed or not beneficial, or do people do it all the time in principle? PROFESSOR: It is possible, and one of the cases is the Netflix case, where you had 100 million points. So you think at this point, nobody will care about cross-validation. But it turned out that even in this case, the 100 million points only had a very small subset which come from the same distribution as the output. So the 100 million-- again, it's the same question as the time evolution. You have people making ratings, and different people making different number of ratings, and this changes for a number of reasons. Even the same user, after you rate for a while, you tend to change from the initial rating. Maybe you are initially excited or something. So there are lots of considerations like that. So eventually, the number of points that were patently coming from the same distribution as the out-of-sample was much smaller than 100 million. And these are the ones that were used to make big decisions, like validation decisions. And in that case, even if we started with 100 million, it might be a good idea to use cross-validation at the end. And if you use something like 10-fold cross-validation, then it's not that big a deal, because you are just multiplying the effort by 10, which is, given what is involved, not that big a deal. And you really get a dividend in performance. And if you insist on performance, then it becomes indicated. So the answer is yes, because it doesn't cost that much, and because sometimes in a big data set, the relevant part, or the most relevant part, is smaller than the whole set. MODERATOR: Say there's a scenario where you find your model through cross-validation, and then you test the out-of-sample error. But somehow you test a different model, and it gives you a smaller out-of-sample error. Should you still keep the one you found through cross-validation? PROFESSOR: So I went through this learning and came up with a model. Someone else went through whatever exercise they have and came up with a final hypothesis in this case. And I am declaring mine the winner because of cross-validation, and now we are saying that there's further statistical evidence. We get an out-of-sample error that tells me that mine is not as good as the other one. Then it really is the question of, I have two samples, and I'm doing an evaluation. And one of them tells me something, and the other one tells me the other. So I need to consider first the size of them. That will give me the relative size of the error bar. And correlations, if any. And bias, which cross-validation may have, whereas the other one, if it's truly out of sample, does not. If I go through the math, and maybe the math won't go through-- it's not always the case-- I will get an indication about which one I would favor. But basically, it's purely a statistical question at this point. MODERATOR: When there are few points, and cross-validation is going to be done, is it a good idea to re-sample to enlarge the current sample, or not really? PROFESSOR: So I have a small data set. That's the premise? And I'm doing cross-validation. So what is the-- MODERATOR: So the problem is, since you have few samples, do you want to re-sample? PROFESSOR: So instead of breaking them into chunks, keep taking at random? Well, I don't have from my experience something that would indicate that one would win over the other. And I suspect that if you are close to 10-fold, you probably are close to the best performance you can get with variations of these methods. And the problem is that all of these things are not completely pinned down mathematically. There is a heuristic part of it, because even cross-validation, we don't know what the correlation is, et cetera. So we cannot definitively answer the question of which one is better. It's a question of trying in a number of problems, after getting the theoretical guidelines, and then choosing something. What is being reported here is that the 10-fold cross-validation stood the test of time. That's the statement. MODERATOR: When there is a big class size imbalance, does cross-validation become a problem? PROFESSOR: When there is an imbalance between the classes, that is a bunch of +1's and fewer -1's, there are certain things that need to be taken into consideration, in order to make learning go through well-- in order to basically avoid the learning algorithm going for the all +1 solution, because it's a very attractive one. So there are a bunch of things that can be taken into consideration, and I can see a possible role for cross-validation. But it's not a strong component as far as I can see. The question of balancing them, making sure that you avoid the all-constant, and stuff like that will probably play a bigger role. MODERATOR: How does the bias behave when we increase the number of points that we leave out? The size of t if we leave t out. PROFESSOR: The points we leave out are the validation points. And if we are using the 10-fold or 12-fold, et cetera, the total number that go into the summation will be constant, because in spite of the fact that we're taking different numbers, we go through all of them, and we add them up. So that number doesn't change. MODERATOR: So how does it change, if instead of doing 10-fold, you use 20-fold? How does that-- PROFESSOR: How does it change? It doesn't change the number of total points going into the estimate of cross-validation. But what was the original question? MODERATOR: So how does the bias behave? PROFESSOR: Oh. Well, given that the total number will give you the error bar, and given that the bias is really a function of how you use it, rather than something inherent in the estimate, the error bar will give you an indication of how vulnerable you are to bias. Say that, if you take two scenarios where the error bar is comparable, you have no reason to think that one of them will be more vulnerable to bias or another. Now, you need a very detailed analysis to see the difference between taking one at a time coming from N minus 1, et cetera, and to consider the correlations, and then taking 1/10 at the time and adding them up, to find out what is the correlation and what is the effective number of examples, and therefore what is the error bar. In any given situation, that would be a pretty heavy task to do. So basically, that answer is that as long as you do a number of folds, and you take every example to appear in the cross-validation estimate exactly once, then there is no preference between them as far as the bias is concerned. MODERATOR: I think that's it. PROFESSOR: Very good. We'll see you on Thursday.