Tip:
Highlight text to annotate it
X
ANNOUNCER: The following program is brought to you by Caltech.
YASER ABU-MOSTAFA: Welcome back.
Last time, we talked about regularization, which is a very
important technique in machine learning.
And the main analytic step that we took is to take a constrained form of
regularization, where you explicitly forbid some of the hypotheses from
being considered, and thereby reducing the VC dimension and improving the
generalization property, to an unconstrained version which creates
an augmented error in which no particular vector of weights is
prohibited per se, but basically you have a preference of weights based on
a penalty that has to do with the constraint.
And that equivalence will make us focus on the augmented-error form of
regularization, in every practice we have.
And the argument for it was to take the constrained version and look at it,
either as a Lagrangian which would be the formal way of solving it, or as we
did it in a geometric way, to find a condition that corresponds to
minimization under a constraint, and find that this would be locally
equivalent to minimizing this in an unconstrained way.
Then we went to the general form of a regularizer, and we called it
Omega of h.
And it depends on small h rather than capital H, which was the other
Omega that we used in the VC analysis.
And in that case, we formed the augmented error as the in-sample
error, plus this term.
And the idea now is that the augmented error will be a better thing
to minimize, if you want to minimize the out-of-sample error, rather than
that-- just minimizing E_in by itself.
And there are two choices here.
One of them is the regularizer
Omega, weight decay or weight elimination, or the other
forms you may find.
And the other one is lambda, which is the regularization parameter-- the
amount of regularization you're going to put.
And the long and short of it is that the choice of Omega in a practical
situation is really a heuristic choice, guided by theory and guided by
certain goals, but there is no mathematical way in a given practical
situation to come up with a totally principled omega.
But we follow the guidelines, and we do quite well.
So we make a choice of Omega towards smoother or simpler hypotheses.
And then we leave the amount of regularization to the determination of
lambda, and lambda is a little bit more principled.
We'll find out that we will determine lambda using validation, which is the
subject of today's lecture.
And when you do that, you will get some benefit of Omega.
If you choose a great Omega, you will get a great benefit.
If you choose an OK Omega, you will get some benefit.
If you choose a terrible Omega, you are still safe, because lambda will
tell you-- the validation will tell you-- just
take lambda equal to 0, and therefore no harm done.
And as you see, the choice of lambda is indeed critical, because when you
take the correct amount of lambda, which happens to be very small in this
case, the fit, which is the red curve, is very close to the target,
which is the blue.
Whereas if you push your luck, and have more of the regularization,
you end up constraining the fit so much that the red-- it wants to move
toward the blue, but it can't because of the penalty, and ends up being
a poor fit for the blue curve.
So that leads us to today's lecture, which is about validation.
Validation is another technique that you will be using in almost every
machine learning problem you will encounter.
And the outline is very simple.
First, I'm going to talk about the validation set.
There are two aspects that I'm going to talk about.
The size of the validation set is critical.
So we'll spend some time looking at the size of the validation set.
And then we'll ask ourselves, why did we call it validation
in the first place?
It looks exactly like the test set that we looked at before.
So why do we call it validation?
And the distinction will be pretty important.
And then we'll go for model selection, a very important
subject in machine learning.
And it is the main task of validation.
That's what you use validation for.
And we'll find that model selection covers more territory than what the
name may suggest to you.
Finally, we will go to cross-validation, which is a type of
validation that is very interesting, that allows you, if I give you
a budget of N examples, to basically use all of them for validation, and all of
them for training, which looks like cheating, because validation will look
like a distinct activity from training, as we will see.
But with this trick, you will be able to find a way to go around that.
Now, let me contrast validation with regularization, as far as control
of overfitting is concerned.
We have seen, in one form or another, the following by-now-famous equation,
or inequality or rule, where you have the out-of-sample error that you want
equal the in-sample error, or at most equal the in-sample error, plus some penalty.
Could be penalty for model complexity, overfit complexity, a bunch of other
ways of describing that.
But basically, this tells us that E_in is not exactly E_out.
That, we know all too well.
And there is a discrepancy, and the discrepancy has to do with the
complexity of something.
An overfit penalty has to do with the complexity of the model you are using
to fit, and so on.
So in terms of this equation, I'd like to pose both regularization and
validation as an activity that deals with this equation.
So what about regularization?
We put the equation.
What did regularization do?
It tried to estimate this penalty.
Basically, what we did is concoct a term that we think captures the
overfit penalty.
And then, instead of minimizing the in-sample, we minimize the in-sample
plus that, and we call that the augmented error.
And hopefully, the augmented error will be a better proxy for E_out.
That was the deal.
And we notice that we are very, very inaccurate in the choice here.
We just say, smooth, pick lambda, you can use this, you can use that.
So obviously, we are not satisfying any equality by any chance.
But we are basically getting a quantity that has a monotonic
property, that when you minimize this, this gets minimized, which does the
job for us.
Now, to contrast this, let's look at validation, when it's dealing with the
same equation.
What does validation do?
Well, validation cuts to the chase.
It just estimates the out-of-sample.
Why bother with this analysis, and overfit, and this and that?
You want to minimize the out-of-sample?
Let's estimate the out-of-sample, and minimize it.
Obviously, it's too good to be true, but it's not totally untrue.
Validation does achieve something in that direction.
So let me spend a few slides just describing the estimate.
I'm trying to estimate the out-of-sample error.
This is not completely a foreign idea to us, because we use a test set in
order to do that.
So let's focus on this, and see what are the parameters involved in
estimating the out-of-sample error.
Let's look at the estimate.
The starting point is to take an out-of-sample point x, y.
This is a point that was not involved in training.
We used to call it test point.
Now we are going to call it validation point.
It's not going to become clear why we are giving it a different name for
a while, until we use the validation set for something, and then the
distinction will become clear.
But as far as you are concerned now, this is just a test point.
We are estimating E_out, and we will just read the value of E_out and be
happy with that, and not do anything further.
So you take this point, and the error on it is the difference between what
your hypothesis does on x, and what the target value is, which is y.
And what is the error?
We have seen many forms of the error.
Let's just mention two to make it concrete.
This could be a simple squared error.
We have seen that in linear regression.
It could be the binary error.
We have seen that in classification.
So nothing foreign here.
Now, if you take this quantity, and we are now treating it as
an estimate for E_out,
a poor estimate, but nonetheless an estimate.
We call it an estimate because, if you take the expected value of that with
respect to the choice of x, with the probability distribution over the
input space that generates x, what will that value be?
Well, that is simply E_out.
So indeed, this quantity, the random variable here, has the correct
expected value.
It's an unbiased estimate or E_out.
But unbiased means that it's as likely to be here or here, in terms of
expected value.
But we could be this, and this would be a good estimate, or we could be
this, and this would be a terrible estimate.
Because you are not getting all of them.
You are just getting one of them.
So if this guy swings very large, and I tell you this is an estimate of
E_out, and you get it here, this is what you will think E_out is.
So there is an error, but the error is not biased.
That's what this equation says.
But we have to evaluate that swing, and the swing is obviously evaluated
by the usual quantity, the variance.
And let's just call the variance sigma squared.
It depends on a number of things, including what is your error measure
and whatnot, but that is what a single point does.
So you get an estimate, but the estimate is poor because it's one
point, and therefore sigma squared is likely to be large.
So you are unlikely to use the estimate on one point as your guide to
E_out.
What do you use?
You move from one point, to a full set.
So you get what?
You get a validation set that you are going to use to estimate E_out.
Now, the notation we are going to have is that the number of points in the
validation set is K. Remember that the number of points in the
training set was N.
So this will be K points, also generated according to the same rules--
independently, according to the probability distribution
over the input space.
And the error on that set we are going to call it E_val, as
in validation error.
So we have E_in, and we have E_out.
Now we are introducing another one, E_val, the validation error.
And the form for it is what you expect it to be.
You take the individual errors on the examples, and you take the average,
like you did with the training set, and this one is the validation error.
The only difference is that this is done out-of-sample.
These guys were not used in training, and therefore you would expect that
this would be a good estimate for the out-of-sample performance.
Let's see if it is.
What is the expected value of E_val, the validation error?
Well, you take the expected value of this fellow.
The expectation goes inside.
So the main component is the expected value of this fellow, which we have
seen before-- expected value on a single point.
And you just average linearly, as you did.
Now, this quantity happens to be E_out.
The expected value on one point is E_out.
Therefore, when you do that, you just get E_out again.
So indeed, again, the validation error is an unbiased estimate of the out-of-sample
error, provided that all you did with the validation set is just measure the
out-of-sample error.
You didn't use it in any way.
Now, let's look at the variance, because that was our problem with the
single-point estimate, and let's see if there's an improvement.
When you get the variance, you are going to take this formula.
And then you are going to have a double summation, and have all cross
terms of e between different points.
So you will have the covariance between the value for k equals 1 and k
equals 2, k equals 1 and k equals 3, et cetera.
And you also have that diagonal guys, which is the variance in this case,
with k equals 1 and k equals 1 again.
The main component you are going to get are the variance, and a bunch of
covariances.
Actually, there are more covariances than variances, because the variances
are the diagonal, the covariances are the off-diagonal.
There are almost K squared of them.
The good thing about the covariance in this case is that it will be 0,
because we picked the points independently.
And therefore, the covariance between a quantity that depends on these
points will be 0.
So I'm only stuck with the diagonal elements, which happen
to have this form.
I have the variance here.
And when I put the summation, something interesting happens.
I have the summation again, a double summation reduced to one,
because I'm only summing the diagonal.
But I still have the normalizing factor with the number of elements.
Because I had K squared elements, the fact that many of them dropped out is
just to my advantage.
I still have the 1 over K squared, and that gives me the better variance for
the estimate based on E_val, than on a single point.
This is your typical analysis of adding a bunch
of independent estimates.
So you get the sigma squared.
That was the variance on a particular point.
But now you divide it by K. Now we see a hope, because even if the original
estimate was this way, maybe we can have K big enough that we keep
shrinking the error bar, such that the E_val itself as a random variable
becomes this, which is around E_out-- what we want.
And therefore, it becomes a reliable estimate.
This looks promising.
Now we can write the E_val, which is a random variable, to be E_out,
which is the value we want, plus or minus something that averages to 0, and
happens to be the order of approximately 1 over square root of K.
If the variance is 1 over K, then the standard deviation is 1 over
square root of K.
I'm assuming here that sigma squared is constant in the
range that I'm using.
And therefore, the dependency on K only comes from here.
Therefore, I have this quantity that tells me this is what I'm
estimating, and this is the error I'm committing, and this is how the error
is behaving as I increase the number K.
The interesting point now is that K is not free.
It's not like I tell you, it looks like if I increase K,
this is a good situation.
So why don't we use more and more validation points?
Because the reality is, K not given to you on top of your training set.
What I give you is a data set, N points, and it's up to you to use how
many to train, and how many to validate.
I'm not going to give you more, just because you want to validate.
So every time you take a point for validation, you are taking it away
from training, so to speak.
Let's see the ramifications of this regime.
K is taken out of N. So let's now have the notation.
We are given a data set D, as we always called it, and it has
N points.
What do we do with it?
We are going to take K points, and use them for validation.
And you can take any K points, as long as you don't look at the particular
input and output.
Let's say you pick K points at random, from the N points.
That will be a valid set of validation for you.
So I have the K points, and therefore I'm left with N minus K for training.
The ones I left for training, I'm going
to call D_train.
I didn't have to use that when I didn't have validation, because
D all went to training, so I didn't need to have the
distinction.
Now, because I have two utilities, I'm going to take the guys that go into
training and call that subset D_train.
And the guys that I hold for validation I'm going to
call it D_val.
The union of them is D. That's the setup.
Now, we looked in the previous slide at the reliability of the estimate of
the validation set.
And we found that this reliability, if we measure it by the error bar on the
fluctuation, it will be the order of 1 over square root of K, the
number of validation points.
Then our conclusion is that if you use small K, you have a bad estimate.
And the whole role we have for validation so far is estimate, so we
are not doing a good job.
So we need to increase K.
It looks like a good idea, just from that point of view, to take
large K.
But there are ramifications for taking large K, so we have a question mark.
And let's try to be quantitative about it.
Remember this fellow?
That was the learning curve.
What did it do?
It told us, as you increase the number of training points, what is the
expected value of E_out and what is the expected value of E_in, for a given
model, the model that I'm plotting the learning curves of. Right?
Now, the number of data points used to be N. Here I'm writing it as N minus
K. Why am I doing that?
Because under the regime of validation, this is what
I'm using for training.
Therefore if you increase K, you are moving in this direction, right?
I used to be here, and I used to expect that level of E_out.
Now I am here, and I'm expecting that level of E_out.
That doesn't look very promising.
I may get a reliable estimate, because I'm using bigger K, but I'm getting
a reliable estimate of a worse quantity.
If you want to take an extreme case, you are going to take this estimate and
go to your customer, and tell them what you expect for the performance to be.
So you don't only deliver the final hypothesis.
You deliver the final hypothesis, with an estimate for how it will do when
they test it on a point that you haven't seen before.
Now, we want the estimate to be very reliable, and you forget about the
quality of the hypothesis.
So you keep increasing K, keep increasing K, keep increasing K.
You end up with a very, very reliable estimate.
The problem is that it's an estimate of a very, very poor quantity, because
you used 2 examples to train, and you are basically in the noise.
So the statement you are going to make to your customer in this case is that,
here is a system.
I am very sure that it's terrible!
That is unlikely to please a customer.
So now, we realize that there is a price to be paid for K. It turns out
that we are going to have a trick that will make us not pay that price.
But still, the question of what happens when K is big is a question
mark in our mind.
What I'm going to do now, I'm going to tell you, you used K to
estimate the error.
Now, what I'm going to tell you, why don't you restore the data set after
you have the estimate? Because the estimate now is in your pocket, train
on the full set, so that you get the better guy.
Well, I estimated for the smaller guy.
What are we doing here?
Let's just do this systematically.
Let's put K back into the pot.
So here is the regime.
I'm going to describe this figure, but let's talk it piece by piece.
We have the data set D, right?
We separated it into D_train and D_val.
D itself has N points.
We took N minus K to train, K to validate.
That's the game.
What happened?
If we used the full training set to train, we would get a final hypothesis
that we called g.
This is just a matter of notation.
But under the regime of validation, you took out some guys.
And therefore, you are using only D_train to train.
And this has N minus K. Doesn't have all the examples.
Therefore, I am going to generically label the final hypothesis that I get
from training on a reduced set, D_train, I am going to call it g minus.
Just to remind ourselves that it's not on the full training set.
So now, here is the idea, if you look at the figure.
I have the D. Let me get it a bit smaller so that we can get the output.
If I use the training set by itself, I would get g.
What I am doing now is that I am going to take D_train, which has fewer
examples, and the rest go to validation.
I use D_train to get g minus, and then I take g minus and evaluate it on
D_val, the validation set, in order to get an estimate.
So the trick now is that instead of reporting g minus as my final
hypothesis, I know if I added the other data points here to the pot, I
am going to get a better out-of-sample.
I don't know what it is.
I don't have an estimate for it.
But I know it's going to be better than the one for g minus, simply
because of the learning curve.
On average, I get more examples, I get better out-of-sample error.
So I put it back and then report g.
So it's a funny situation.
I'm giving you g, and I'm giving you the validation estimate on g minus.
Why?
Because that's the only estimate I have.
I cannot give you the estimate on g, because now if I get g, I don't have
any guys to validate on.
So you can see now the compromise.
Under this scenario, I'm not really losing in performance by taking
a bigger validation set, because I'm going to put them back when I get
the final hypothesis.
What I am losing here is that, the validation error I'm reporting, is
a validation error on a different hypothesis than the
one I am giving you.
And if the difference is big, then my estimate is bad, because I'm
estimating on something other than what I am giving you.
And that's what happens when you have large K.
When you have large K, the discrepancy between g minus and g is bigger.
And I am giving you the estimate on g minus.
So that estimate is poor.
And therefore, I get a bad estimate again.
Now, you see the subtlety here.
This is the regime that is used in validation, universally.
After you do your thing, and you do your estimates, and, as you will see
further, you do your choices, you go and put all the examples to train on,
because this is your best bet of getting a good hypothesis.
If your K is small, the validation error is not reliable.
It's a bad estimate, just because the variance of it is big.
I have small K, it's 1 over square root of K, so I'm doing this.
If you get big K, the problem is not the reliability of the estimate.
The problem is that the thing you are estimating is getting further and
further away from the thing you are reporting.
So now we have a compromise.
We don't want K to be too small, in order not to have fluctuations.
We don't want K to be too big, in order not to be too far from
what we are reporting.
And as usual in machine learning, there is a rule of thumb.
And the rule of thumb is pretty simple.
That's why it's a rule of thumb.
It says, take one fifth for validation.
That usually gives you the best of both worlds.
Nothing proved.
You can find counterexamples.
I'm not going to argue with that.
It's a rule of thumb.
Use it in practice, and actually you will be quite successful here.
There's an argument with some people, whether it should be N
over 5 or N over 6.
I'm going not going to fret over that.
It's a rule of thumb, after all, for crying out loud!
We'll just leave it at that.
So we now have that.
Let's go to the other aspect.
We know what validation is, and we understand how critical it is to
choose the number, and we have a rule of thumb.
Now let's ask the question, why are we calling this validation
in the first place?
So far, it's purely a test.
We get an out-of-sample point.
The estimate is unbiased.
What is the deal?
We call it validation, because we use it to make choices.
And this is a very important point, so let's talk about it in detail.
Once I make my estimate affect the learning process, the set I am using
is going to change nature.
So let's look at a situation that we have seen before.
Remember this fellow?
Yeah, this was early stopping in neural networks.
And let me magnify it for you to see the green curve.
Do you see the green curve now?
OK, so there is a green curve.
Now we'll scoot it back.
So the in-sample goes down.
Out-of-sample--
let's say that I have a general estimate for the out-of-sample--
goes down with it until such a point that it goes up, and we have the
overfitting, and we talked about it.
And in this case, it's a good idea to have early stopping.
Now, let's say that you are using K points, that you did not use for
training, in order to estimate E_out.
That would be E_test, the test error, if all you are doing is just plotting
the red in order to look at it and admire it.
Oh, that's a nice curve.
Oh, it's going up.
But you're not going to take any actions based on it.
Now, if you decide that, this is going up,
I had better stop here.
That changes the game dramatically.
All of a sudden, this is no longer a test error.
Now it's a validation error.
So you ask yourself, what the heck?
It's just semantics.
It's the same curve.
Why am I calling it a different name?
I'm calling it a different name, because it used to be unbiased.
That is, if this is an estimate of E_out, not the actual E_out, there will be
an error bar in estimating E_out.
But it is as likely to be optimistic, as pessimistic.
Now, when you do early stopping, you say, I'm going to stop here and I'm
going to use this value as my estimate for what you are getting.
I claim that your estimate is now biased.
It's the same point.
You told us it was unbiased before.
What is the deal?
Let's look at a very specific simple case, in order to
understand what happens.
This is no longer a test set.
It becomes, in red, a validation set.
Fine, fine.
Now convince us of the substance of it.
We know the name.
So let's look at the difference, when you actually make a choice.
Very simple thing that you can reason.
Let's say I have the test set, which is unbiased, and I'm claiming that the
validation set has an optimistic bias.
Optimism is good.
But here, it's optimism followed by disappointment.
It's deception.
We are just calling it optimistic, to understand that it's always in the
direction of thinking that the error will be smaller than it will, actually,
turn out to be.
So let's say we have two hypotheses.
And for simplicity, let's have them both have the same E_out.
So I have two hypotheses.
Each of them has out-of-sample error 0.5.
Now, I'm using a point to estimate that error.
And I have two estimates,
e_1 for the hypothesis 1, and e_2 for the hypothesis 2.
I'm going to use-- because the estimate has fluctuations in it, just
again for simplicity, I'm going to assume that both e_1 and e_2 are uniform
between 0 and 1.
So indeed, the expected value is half, which is the expected value I want,
which is the out-of-sample error.
Now, I'm not going to assume strictly that e_1 and e_2 are independent, but
you can assume they are independent for the sake of argument.
But they can have some level of correlation, and you'll still get the
same effect.
Let's think now that they are independent variables, e_1 and e_2.
Now, e_1 is an unbiased estimate of its out-of-sample error, right?
Right.
e_2 is the same, right?
Right.
Unbiased means the expected value is what it should be.
And the expected value, indeed in this case,
is what it should be, 0.5.
Now, let's take the game, where we pick one of the hypotheses,
either h_1 or h_2.
How are we going to pick it?
We are going to pick it according to the value of the error.
So now, the measurement we have is applying to that choice.
What I'm going to do, I'm going to pick the
smaller of e_1 and e_2.
And whichever that one is, I'm going to pick the hypothesis that
corresponds to it.
So this is mini learning.
The error--
just pick one, and this is the one.
My question to you is very simple.
What is the expected value of e?
A naive thought would say, you told us the expected value of e_1 is 1/2.
You told us the expected value of e_2 is 1/2.
e has to be either e_1 or e_2.
So the expected value should be 1/2?
Of course not, because now the rules of the game-- the probabilities that
you're applying-- have changed, because you are deliberately picking the
minimum of the realization.
And it's very easy to see that the expected value of e is less than 0.5.
The easiest thing to say is that if I have two variables like that, the
probability that the minimum will be less than 1/2 is 75%, because all you
need to do is one of them being less than 1/2.
If the probability of being less than 1/2 is 75%, you expect the
expected value to be less than 1/2.
It's mostly there.
The mass is mostly below.
So now you realize this is what?
This is an optimistic bias.
And that is exactly the same as what happened with the early stopping.
We picked the point because it's minimum on the realization, and that is what
we reported.
Because of that-- the thing used to be this,
but we wait.
When it's there, we ignore it.
When it's here, we take it.
So now that introduces a bias, and that bias is optimistic.
And that will be true for the validation set.
Our discussion so far is based on just looking at the E_out.
Now we're going to use it, and we're going to introduce a bias.
Fortunately for us, the utility of validation in machine learning is so
light, that we are going to swallow the bias.
Bias is minor.
We are not going to push our luck.
We are not going to estimate tons of stuff, and keep adding bias until the
validation error basically becomes training error in disguise.
We're just going to-- let's choose a parameter, choose between models,
and whatnot.
And by and large, if you do that, and you have a respectable-size validation
set, you get a pretty reliable estimate for the E_out,
conceding that it's biased, but the bias is not going to hurt us too much.
So this is the general philosophy.
Now with this understanding, let's use validation set for model
selection, which is what validation sets do.
That is the main use of validation sets.
And the choice of lambda, in the case we saw, happens to be
a manifestation of this.
So let's talk about it.
Basically, we are going to use the validation set more than once.
That's how we are going to make the choice.
So let's look.
This is a diagram.
I'm going to build it up.
Let's build it up, and then I'll focus on it, and look at how the
diagram reflects the logic.
We have M models that we're going to choose from.
When I say model, you are thinking of one model versus another, but this is
really talking more generally.
I could be talking about models as in, should I use linear models or neural
networks or support vector machines?
These are models.
I could be using only polynomial models.
And I'm asking myself, should I go for 2nd order, 5th
order, or 10th order?
That's a choice between models.
I could be using 5th-order polynomials throughout, and the only
thing I'm choosing,
should I choose lambda of the regularization to be 0.01, 0.1, or 1?
All of this lies under model selection.
There's a choice to be made, and I want to make it in a principled way,
based on the out-of-sample error, because that's the bottom line.
And I'm going to use the validation set to do that.
This is the game.
So we'll call them, since they're models, I have H_1 up to H_M.
And we are going to use D to train, and I am going to get
as a result of that--
it's not the whole set, as usual, so I left some for validation.
And I'm going to get g minus.
That is our convention for whenever we train on something less
than the full set.
But because I'm getting a hypothesis from each model, I am labeling it
by the subscript m. So there is g_1 up to g_M, with a minus, because they
used D_train to train.
So I get one for each model.
And then I'm going to evaluate that fellow, using the validation set.
The validation set are the examples that were left out from D, when I took
the D_train.
So now I'm going to do this.
All I'm doing is exactly what I did before, except I'm doing it M
times, and introducing the notation that goes with that.
Let's look at the figure now a little bit.
Here is the situation.
I have the data set.
What do I do with it?
I break it into two,
validation and training.
I use the training to apply to each of these
hypothesis sets, H_1 up to H_M.
And when I train, I end up with a final hypothesis.
It is with a minus, a small minus in this case, because I'm
training on D_train.
And they correspond to the hypotheses they came from, so g_1, g_2, up to g_M.
These are done without any validation, just training
on a reduced set.
Once I get them, I'm going to evaluate their performance.
I'm going to evaluate their performance using the validation set.
So I take the validation set and run it here.
It's out-of-sample as far they're concerned, because it's
not part of D_train.
And therefore, I'm going to get estimates-- these are the validation errors.
I'm just giving them a simple notation as E_1, E_2, up to E_M.
Now, your model selection is to look at these errors, which supposedly
reflect the out-of-sample performance if you use this as your final product,
and you pick the best.
Now that you are picking one of them, you immediately have alarm bells--
bias, bias, bias.
Something is happening now, because now we are going to be biased.
Each of these guys was an unbiased estimate of the out-of-sample error of
the corresponding hypothesis.
You pick the smallest of them, and now you have a bias.
So the smallest of them will give the index m star, whichever that might be.
So E_m star is the validation error on the model we selected, and now we
realize it has an optimistic bias.
And we are not going to take g_m star minus, which is the one that
gave rise to this.
We are now going to go back to the full data set, as
we said in our regime.
We are going to train with it.
And from that training, which is training now on the model we chose, we
are going to get the final hypothesis, which is g_m star.
So again, we are reporting the validation error on a reduced
hypothesis, if you will, but reporting the hypothesis-- the best we can do,
because we know that we get better out-of-sample
when we add the examples.
So this is the regime.
Let's complete the slide.
E_m, that we introduced, happens to be the value of the validation error on
the reduced, as we discussed.
And this is true for all of them.
And then you pick the model m star, that happens to have the smallest E_m.
And that is the one that you are going to report, and you are going to
restore your D, as we did before, and this is what you have.
This is the algorithm for model selection.
Now, let's look at the bias.
I'm going to run an experiment to show you the bias.
So let me put it here and just build towards it.
What is the bias now?
We know we selected a particular model, and we selected
it based on D_val.
That's the killer.
When you use the estimate to choose, the estimate is no longer
reliable, because you particularly chose it, so now it looks
optimistic.
Because by choice, it has a good performance, not because it has
an inherently good performance. Because you looked for the one with
the good performance.
So the expected value of this fellow is now a biased estimate of the
ultimate quantity we want, which is the out-of-sample error.
So E_val, the sample thing, is biased from that.
And we would like to evaluate that.
Here is the illustration on the curve, and I'm going to ask you
a question about it, so you have to pay attention in order to be able to
answer the question.
Here is the experiment.
I have a very simple situation.
I have only two models to choose between.
One of them is 2nd-order polynomials, and the other one is
5th-order polynomials.
I'm generating a bunch of problems, and in each of them, I make a choice
based on validation set.
And after that, I look at the actual out-of-sample error.
And I'm trying to find out whether there is a systematic bias in the one
I choose, with respect to its out-of-sample error.
So it's not clear which--
I'm saying that I chose H_2 or H_5.
In each run, I may choose H_2 sometimes and H_5 sometimes, whichever gave me
the smaller E_val.
And I'm taking the average over an incredible number of runs.
That's why you have a smooth curve.
So this will give me an indication of the typical bias you get when you make
a choice between two models, under the circumstances of the experiment.
Now the experiment is done carefully, with few examples.
The total is 30-some examples.
And I'm taking a validation set, which is 5 examples, 15
examples, up to 25 examples.
So at this point really, the number of examples left for
training is very small.
And I'm plotting this, so this is what I get for the average over the runs, of
the validation error on the model I chose-- the final hypothesis of the
model I chose.
And this is the out-of-sample error of that guy.
Now, I'd like to ask you two questions.
Think about them, and also for the online audience,
please think about them.
First question--
why are the curves going up?
This is K, the size of the validation set.
I'm evaluating it.
It's not because I'm evaluating on more points that the
curves are going up.
It's because when I use more for validation, I'm inherently using less
for training.
So there's an N minus K that is going the other direction.
And what we are seeing here really is the learning curve, backwards.
This is E_out.
I have more and more examples to train as I go here, so the out-of-sample
error goes down.
So in the other direction, it goes up.
And this, being an estimate for it, goes up with it.
So that makes sense.
Second question--
why are the two curves getting closer together?
Whether they're going up or down, that's not my concern at this point,
just the fact that they are converging to each other.
Now, that has to do with K proper, directly.
The other had to do with K indirectly, because I'm left with N minus K.
But now, when I have bigger K, the estimate is more and more reliable,
and therefore I get closer to what I'm estimating.
So we understand this.
This is the definitely evidence,
and in every situation you will have, there will be a bias.
How much bias depends on a number of factors, but the bias is there.
Let's try to find, analytically, a guideline for the type of bias.
Why is that?
Because I'm using the validation set to estimate the out-of-sample error,
and I'm really claiming that it's close to the out-of-sample error.
And we realize that, if I don't use it too much, I'll be OK.
But what is too much?
I want to be a little bit quantitative about it, at least as a guideline.
So I have M models, and you can see that the M is in red.
That should remind you when we had M in red very early in the
course, because M used to make things worse.
It was the number of hypotheses, when we were talking about generalization.
And it was really that, when you have bigger M, you are in bigger trouble.
So it seems like we are also going to be in bigger trouble here, but the
manifestation is different.
We have now M models we are choosing from-- models
in the general sense.
This could be M values of the regularization parameter lambda in
a fixed situation, but we are still making one of M choices.
Now, the way to look at it is to think that the validation set is actually
used for training, but training on a very special hypothesis set, the
hypothesis set of the finalists.
What does that mean?
So I have H_1 up to H_M.
I'm going to run a full training algorithm on each of them, in order to
find a final hypothesis from each, using D_train.
Now, after I'm done, I am only left with the finalists,
g_1 up to g_M, with a minus sign because they are trained on the
reduced set.
So the hypothesis set that I am training on now is just those guys.
As far as the validation set is concerned, it didn't know what
happened before.
It doesn't relate to D_train.
All you did, you gave it this hypothesis set, which is the final
hypotheses from your previous guy, and you are asking it to choose.
And what are you going to choose?
You are going to choose the minimum error.
Well, that is simply training.
If I just told you that this is your hypothesis set, and that D_val is your
training, what would you do?
You will look for the hypothesis with the smallest error.
That's what you are doing here.
So we can think of it now as if we are actually training on this set.
And this tells us, oh, we need to estimate the discrepancy or the bias
between this and that.
Now it's between the validation error and the out-of-sample error.
But the validation error is really the training error on this special set.
So we can go back to our good old Hoeffding and VC, and say that the
out-of-sample error,
in this case, given from those-- and now you can see that the
choice here is star.
So I'm actually choosing one of those guys.
This is my training, and the final, final hypothesis is this guy-- is less
than or equal to the out-of-sample error, plus a penalty for the model
complexity.
And the penalty, if you use even the simple union bound,
will have that form.
You still have the 1 over square root of K, so you can always make it
better by having more examples.
But then you have a contribution because of the number of guys you are
choosing from.
If you are choosing between 10 guys, that's one thing.
If you are choosing between 100 guys, that's another.
It's worse.
Well, benignly worse, because it's logarithmic, but
nonetheless, worse.
And if you are choosing between an infinite number of guys, we know
better than to dismiss the case off-hand.
You say, infinite number of guys, we can't do that.
No, no, no. Because once you go to the infinite choices,
you don't count anymore.
You go for a VC dimension of what you are doing.
That's what the effective complexity goes with.
And indeed, if you're looking for choice of one parameter, let's say I'm picking the
regularization parameter.
When you are actually picking the regularization parameter, and you
haven't put a grid-- you don't say, I'm choosing between 1, 0.1,
and 0.01, et cetera--
a finite number.
I'm actually choosing the numerical value of lambda, whatever it would be.
So I could end up with lambda equal 0.127543.
You are making a choice between an infinite number of guys, but you don't
look at it as an infinite number of guys.
You look at it as a single parameter.
And we know a single parameter goes with a VC dimension 1.
That doesn't faze us.
We dealt with VC dimensions much bigger than that.
And we know that if we have one parameter, or maybe two parameters,
and the VC dimension maybe is 2, if you have a decent set--
in this case decent K, not decent N, because that's the size of the set you
are talking about--
then your estimate will not be that far from E_out.
This is the idea.
So now you can apply this with the VC
analysis. Instead of just going for the number, which is the union bound,
you go for the VC version, and now apply it to this fellow.
And you can ask yourself, if I have a regularization
parameter, what do I need?
Or if I have another thing, which is the early stopping.
What is early stopping?
I'm choosing how many epochs to choose.
Epochs is integer, but there is a continuity to it, so I'm
choosing where to stop.
All of those choices, where one parameter is being chosen one way or
the other, correspond to one degree of freedom.
So if I tell you the rule of thumb is that, when you are using the
validation set, if it's a reasonable-size set, let's say 100 points, and you
use those 100 points to choose a couple of parameters, you are OK.
You already can relate to that.
You don't need me to tell you that.
Because, 100 points, VC dimension 2,
yeah, I can get something.
Now, if I give you the 100 points and tell you you are choosing 20
parameters, you immediately say, this is crazy.
Your estimate will be completely ruined, because you are now
contaminating the thing.
This is now genuinely training, because the choice of the value of
a parameter is what?
Well, that's what training did.
The training of a neural network tried to choose the weights of the network,
the parameters.
There were just so many of them, that we called it training.
Now, when it's only one parameter or two, we call it choice of a parameter by
validation.
So it's a gray area.
If you push your luck in that direction, the validation estimate
will lose its main attraction, which is the fact that it's a reasonable
estimate of the out-of-sample, that we can rely on.
The reliability goes down.
So there is this tradeoff.
So with the data contamination, let me summarize it as follows.
We have error estimates.
We have seen some of them.
We looked at the in-sample error, the out-of-sample error, or E_test, and
then we have E_val, the validation error.
I'd like to describe those as data contamination, that if you use the
data to make choices, you are contaminating it as far as its ability
to estimate the real performance.
That's the idea.
So you can look at what is contamination.
It's the built-in optimistic, better described as deceptive
because it's bad--
you are going to go to the bank and tell them, I can
forecast the stock market.
No, you can't.
So that's bad.
You were optimistic before you went there.
After that, you are in trouble.
You are trying to get the bias in estimating E_out, and you are trying to
measure what is the level of contamination.
So let's look at the three sets we used.
We have the training set.
This is just totally contaminated.
Forget it.
We took a neural network with 70 parameters, and we did backpropagation, and
we went back and forth, and we ended up with something, and we have a great
E_in, and we know that E_in is no indication of E_out.
This has been contaminated to death.
So you cannot really rely on E_in, as an estimate for E_out.
When you go to the test set, this is totally clean.
It wasn't used in any decisions.
It will give you an estimate.
The estimate is unbiased.
When you give that as your estimate, your customer is as likely to be
pleasantly surprised, as unpleasantly surprised.
And if your test set is big, they are likely not to be surprised at all.
It'll be very close to your estimate.
So there is no bias there.
Now, the validation set is in between.
It's slightly contaminated, because it made few choices.
And the wisdom here,
please keep it slightly contaminated.
Don't get carried away.
Sometimes when you are in the middle of a big problem, with lots of
data, you choose this parameter.
Then, oh, there's another parameter I want to choose, so you use the same
validation set--
alarm bells, alarm bells-- and you keep doing it.
So you should have a regime to begin with, that you should have not only one
validation set.
You could have a number of them, such that when one of them gets dirty,
contaminated, you move on to the other one which hasn't been used for
decisions, and therefore the estimates will be reliable.
Now we go to cross-validation.
Very sweet regime, and it has to do with the dilemma about K. So now we're
not talking about biased versus unbiased, because this is
already behind us.
Now we're looking at an estimate and a variation of the
estimate, as we did before.
And we have the discipline to make sure that we don't mess it up by
making it biased.
So that is taken for granted.
Now I'm just looking at a regime of validation as we described it, versus
another regime which will get us a better estimate, in terms of the error
bar, the fluctuation around the estimate we want.
So we had the following chain of reasoning.
E_out of g, the hypothesis we are actually going to report, is what we
would like to know.
If we know that, we are set.
We don't have that, but that is approximately the same
as E_out of g minus.
This is the out-of-sample error, the proper out-of-sample error, but on the
hypothesis that was trained on a reduced set.
Correct?
And if I didn't take too many examples, they are
close to each other.
This one happens to be close to the validation estimate of it.
So here, it is because it's a different set that I'm training on.
Here, it's because I am making a finite-sample
estimate of the quantity.
Here, I could go up and down from this.
I'm looking at this chain. This is really what I want, and this
is what I'm working with.
This is unknown to me.
In order to get from here to here, I need the following.
I need K to be small, so that g minus is fairly close to g.
And therefore I can claim that their out-of-sample error is close, because
the bigger K is, the bigger the discrepancy between the training set
and the full set, and therefore the bigger the discrepancy between the
hypothesis I get here, and the hypothesis I get here.
So I'd like K to be small.
But also, I'd like K to be large, because the bigger K is, the more
reliable this estimate is, for that.
So I want K to have two conditions.
It has to be small, and it has to be large.
We will achieve both.
You'll see in a moment.
New mathematics is going to be introduced!
So here is the dilemma.
Can we have K to be both small and large?
The method looks like complete cheating, when you look at it first,
and then you realize, this is actually valid.
So what do we do?
I'm going to describe one form of cross-validation, which is the simplest
to describe, which is called "leave one out". Other methods will be
"leave more out", that's all.
But let's focus on "leave one out".
Here is the idea.
You give me a data set of N. I am going to use N minus
1 of them for training.
That's good, because now I am very close to N, so the hypothesis g minus
will be awfully close to g.
That's great, wonderful, except for one problem.
You have one point to validate on.
Your estimate will be completely laughable, right?
Not so fast.
In terms of a notation, I'm going to create a reduced data set from
D, call it D_n, because I'm actually going to repeat this
exercise for different indices n.
What do I do?
I take the full data set, and then take one of the points, that happens to
be n, and take it out.
This will be the one that is used for validation.
And the rest of the guys are going to be used for training.
Nothing different, except that it's a very small validation set.
That's what is different.
Now the final hypothesis, that we learn from this particular set, we have
to call g minus because it's not on the full set.
But now, because it depends on which guy we left out, we give it the label
of the guy we left out.
So we know that this one is trained on all the examples but n.
Let's look at the validation error, which has to be one point.
This would be what?
This would be E validation-- big symbol of this and that--
but in reality, the validation set is one point, so this is simply just the
error on the point I left out.
g did not involve the n-th example.
It was taken out.
And now that we froze it, we are going to evaluate it on that example,
so that example is indeed out-of-sample for it.
So I get this fellow.
Now, I know that this guy is an unbiased estimate, and I know that
it's a crummy estimate.
That much, I know.
Now, here is the idea.
What happens if I repeat this exercise for different n?
So I generate D_1, do all of this, and end up with this estimate.
Do D_2, all of this-- end up with another estimate.
Each estimate is out-of-sample with respect to the hypothesis that it's
used to evaluate.
Now, the hypotheses are different.
So I'm not really getting the performance of a particular hypothesis.
For this hypothesis, this is the estimate.
It's off.
For this hypothesis, this is the estimate.
It's off.
For this hypothesis, this is the estimate.
The common thread between all the hypotheses is that they are hypotheses
that were obtained by training on N minus 1 data points.
That is common between all of them.
It's different N minus 1 data points, but nonetheless,
it's N minus 1.
Because of the learning curve, I know there is a tendency.
If I told you this is the number of examples, you can tell me what is the
expected out-of-sample error.
So in spite of the fact that these are different hypotheses, the fact that
they come from the same number of points,
N minus 1,
tells me that they are all realizations of something that is the
expected value of all of them.
So the small errors estimate the error on these guys,
and these guys estimate the error of the expected value on N minus 1
examples, regardless of the identity of the examples.
So there is something common between these guys.
They are trying to estimate something.
So now what I'm going to do, I am going to define the cross-validation
error to be--
E cross-validation, E_cv,
to be the average of those guys.
It's a funny situation now.
These came from N full training sessions, each of them
followed by a single evaluation on a point, and I get a number.
And after I'm done with all of this, I take these numbers and average them.
Now, if you think of it as a validation set, now all of a sudden
the validation set is very respectable.
It has N points.
Never mind the fact that each of them is evaluated on a different
hypothesis.
I was able to use N minus 1 points to train, and that will give me
something very close to what happens with N. And I'm
using N points to validate.
The catch, obviously--
these are not independent, because the examples were used to create the
hypotheses, and some example was used to evaluate them.
And you will see that each of them is affected by the other, because the
hypothesis either has the point you left out, or you are evaluating on that.
Let's say, e_1 and e_3.
e_1 was used to evaluate the error on a hypothesis that involved the third
example, because the third example was in, when I talk about e_1.
Then e_3 was used to evaluate on the third example, but on a hypothesis
that involved e_1.
So you can see where the correlation is.
Surprisingly, the effective number, if you use this, is very close to N. It's
as if they were independent.
If you do the variance analysis, you will be using
out of 100 examples, it's probably as if you were using 95 examples.
So it's remarkably efficient, in terms of getting that.
So this is the algorithm.
Now, let's illustrate it.
If you understand this, you understand cross-validation.
I'm illustrating it for the "leave one out".
I have a case.
I am trying to estimate a function.
I actually generated this function using a particular target.
I'm not going to tell you yet what it is. Added some noise.
And I am trying to use cross-validation, in order to choose a model,
or to just evaluate the out-of-sample error.
So let's evaluate the out-of-sample error using the cross-validation
method, for a linear model.
So what do you do?
First order of business, take a point that you will leave out. Right?
So now, this guy is the training set, and this guy is the validation set.
It's one point.
Then you train.
And you get a good fit.
Then, you evaluate the validation error on the point you left out.
That will be that.
That's one session.
We are going to repeat this three times, because we have three points.
So this is the second time we do it.
This time, this point was left out.
These guys were the training.
I connected them and computed the error.
Third one. You can see the pattern.
After I am done, I'm going to compute the cross-validation error to
be simply the average of the three errors.
So let's say we are using squared errors.
e_1 is the squared of this distance, et cetera, and you are
adding them up, one third.
This will be the cross-validation error.
What I am saying now is that you are going to take this as
an indication for how well the linear model fits the data, out-of-sample.
If you look in-sample, obviously it fits the data perfectly.
And if you use the three points, the line will be something like that.
It will fit it pretty decently.
But you have no way to tell how you are going to perform out-of-sample.
Here, we created a mini out-of-sample, in each case, and we took the average
performance of those as an indication of what will happen out-of-sample.
Mind you, we are using only 2 points here.
And when we are done, we are going to use it on 3 points.
That's g minus versus g.
It's a little bit dramatic here, because 2 and 3--
the difference is 1, but the ratio is huge.
But think of 99 versus 100.
Who cares?
It's close enough.
This is just for illustration.
So let's use this for model selection.
We did the linear model, and we call it linear.
So now let's go for the usual suspect, the constant model, exactly with the
same data set.
Let's look at the first guy.
These are the two points left out, the two points left out, and this is the
one for validation.
You train on those.
Here, you connected. Here,
you have the middle number--
it's a constant number.
And this would be you error here. Right?
Second guy, you get the idea? Third guy.
Now, if your question is: is the linear model better
than the constant model in this case?
the only thing you look in all of this is the cross-validation error.
So this guy, this guy, this guy, averaged, is the grade--
negative grade, because it's error-- for the linear model.
This guy, this guy, this guy, averaged, is that grade for the constant model.
And as you see, the constant model wins.
And it's a matter of record that these three points were actually generated
by a constant model.
Of course, they could have been generated by anything.
But on average, they will give you the correct decision.
And they avoid a lot of funny heuristics that you can apply.
You can say-- wait a minute, linear model, OK.
Any two points I pick, the slope here is positive.
So there is a very strong indication that there is a positive slope
involved, and maybe it's a linear model with a positive slope.
Don't go there.
You can fool yourself into any pattern you want.
Go about it in a systematic way.
This is a quantity we know, the cross-validation error.
This is the way to compute it.
We are going to take it as the indication, notwithstanding that there
is an error bar because it's a small sample, in this case 3,
and also because we are making the decision for 2 points, and we are
using it for 3 points.
These are obviously inherent, but at least it gives you something
systematic.
And indeed, it gives you the correct choice in this case.
So let's look at cross-validation in action.
I'm going to go with a familiar case.
You remember this one?
Oh, these were the handwritten digits, and we
extracted two features,
symmetry and intensity.
And we are plotting the different guys, and we would like to find
a separating surface.
We are going to use a nonlinear transform, as we always do.
And in this case, what I'm going to do, I'm going to sample 500 points
from this set at random for training, and use the rest for testing the
hypothesis.
What is the nonlinear transformation?
It's huge--
5th order.
So I am going to take all 20 features, or 21
including the constant.
And what am I going to use validation for?
This is the interesting part.
What I'm going to use validation for is, where do I cut off?
So I'm comparing 20 models.
The first model is, just take this guy.
Second model is, take x_1 and x_2.
Third model, take x_1, x_2, and x_1 squared, et cetera.
Each of them is a model.
I can definitely train on it and see what happens.
And I'm going to use cross-validation "leave one out", in order to choose
where to stop.
So if I have 500 examples, realize that every time I do this, I have to
have 500 training sessions.
Each training session has 499 points.
It's quite an elaborate thing.
But when you do this, this is the curve you get.
You get different errors.
Let me magnify it.
This is the number of features used.
This is the cutoff I talked about.
You can go all the way up to 20 features.
When you look at the training error, not surprisingly, the training error
always goes down.
What else is new?
You have more, you fit better.
The out-of-sample error, which I'm evaluating on the points that were not
involved at all in this process, cross-validation or otherwise, just out of
sample totally, I get this fellow.
And the cross-validation error, which I get from the 500 examples by
excluding one point at a time and taking the average, is remarkably
similar to E_out.
It tracks it very nicely.
And if I use it as a criterion for model choice, the minima are here.
So if I take between 5 and 7, let's say I take 6.
I would say, let me cut off at 6 and see what the performance is like.
Let's look at the result of that,
without validation, and with validation.
Without validation, I'm using the full model, all 20.
And you can see, we have seen this before-- overfitting.
I'm sweating bullets to include this single point in the middle, and after
I included it, guess what?
None of the out-of-sample points was red here.
This was just an anomaly.
So I didn't get anything for it. 1283 01:05:04,380 --> 01:05:05,870t This is a typical thing.
It's unregularized.
Now, when you use the validation, and you stop at the 6th because the
cross-validation error told you so, it's a nice, smooth surface.
It's not perfect error, but it didn't put an effort where it didn't belong.
And when you look at the bottom line, what is the in-sample error here?
0%.
You got it perfect.
We know that.
And the out-of-sample sample error?
2.5%.
For digits, that's OK,.
OK, but not great.
Here, we went.
And now the in-sample error is 0.8%.
But we know better.
We don't care about the in-sample error going to 0.
That's actually harmful in some cases.
The out-of-sample error is 1.5%.
Now, if you are in the range-- 2.5% means that you are
performing 97.5%.
Here, you are performing 98.5%.
40% improvement in that range is a lot.
There is a limit here that you cannot exceed.
So here, you are really doing great by just doing that simple thing.
Now you can see why validation is considered, in this context, as
similar to regularization.
It does the same thing.
It prevented overfitting, but it prevented overfitting by estimating
the out-of-sample error, rather than estimating something else.
Now, let me go and very quickly--
and I will close the lecture with it--
give you the more general form.
We talked about "leave one out". Seldom you use "leave one out"
in real problems, and you can think of why.
Because if I give you 100,000 data points, and you want to leave one out,
you are going to have 100,000 sessions training on 99,999 for each, and you
will be an old person before the results are out.
So when you have "leave one out", you have N training sessions
using N minus 1 points each, right?
Now, let's consider to take more points for validation.
1 point makes it great, because N minus 1 is so close to N,
that my g minus will be so close to g.
But hey, 100,000, if you decided to take 100,000 minus 1,000,
that's still 99,000.
That's fairly close to 100,000.
You don't have to make it difference 1.
So what you do is, you take your data set, and you break it into
a number of folds.
Let's say 10-fold.
So this will be 10-fold cross-validation.
And each time, you take one of the guys here, that is, 1/10 in this
case, use it for validation, and the 9/10, you use them for training.
And you change, from one run to another, which one you take for validation.
So "leave one out" is exactly the same, except that here, the 10,
replace it by N. I break the thing into 1 example at a time, and then I
validate on 1 example.
Here, I'm taking a chunk.
And therefore, you have fewer training sessions,
in this case 10 training sessions,
with not that much of a difference, in terms of the number of examples.
If N is big, instead of taking 1, you take a few more.
Now, the reason I introduced this is because this is what I actually
recommend to you.
Very specifically, 10-fold cross-validation works
very nicely in practice.
So the rule is, you take the total number of examples, divide them by 10,
and that is the size of your validation set.
You repeat it 10 times, and you get an estimate, and you are ready to go.
That's it.
I will stop here, and we'll take questions after a short break.
Let's start the Q&A. And we have an in-house question.
STUDENT: You told about validation, and you told that we
should restrict ourselves in amount of parameters we should estimate.
Do we have a rule of thumb about the number of these parameters?
So is, say, K over 20 parameters reasonable for the maximum number?
PROFESSOR: It obviously depends on the number of data points.
So the reason why I didn't give a rule of thumb in this case, because it goes
with the number of points.
But let's say that if I have 100 points for validation, so it's a small
data set, I would say that a couple of parameters would be fine.
At least, that's my own experience.
And you can afford more, when you have more.
And when you have more, you can even afford more than one validation set,
in which case, you use each of them for a different estimate.
But the simplest thing, I would say, a couple of parameters for 100 points
would be OK.
MODERATOR: Can you clarify why model choice by validation doesn't count as
data snooping?
PROFESSOR: For the same reason that the answer is usually given for
a question like that, because it is accounted for.
I took the validation set, the validation set are patently out of
sample, and I used them to make a choice.
And when I did that choice, I made sure that the discrepancy between
in-sample and out-of-sample on the validation set is very little.
So we had this discussion of how much bias there is, and we want to make
sure that the discrepancy is very little.
So because I have already done the accounting, I can take it as
a reasonable estimate for the out-of-sample.
That is why.
In the other case, the problem with the data snooping that I gave is that
you use the data in order to make choices, and in
that case, huge choices.
You looked at the data and you chose between different models, and you
didn't pay for it.
You didn't account for it.
That's where the problem was.
MODERATOR: Some people recommend using cross-validation 10 times.
What does that add?
PROFESSOR: The regime I described, I only need to tell you
10-fold, 12-fold, 50-fold, and then the rest is fixed.
So if I use 10-fold, then by definition I'm going
to do this 10 times.
It's not a choice, given the regime that I described. In each
run, I am choosing one of the 10 to be my validation, and the rest for
training, and taking the average.
So the question is asking, do I do this 10 times?
Inherently, built in the method is that you use it 10 times,
if that's the question.
MODERATOR: I think the question goes to, since you chose your 10 data sets
inside, then you'd run cross-validation.
What if you do it again choosing 10 subsets, and you repeat that process?
PROFESSOR: There are variations.
For example, even, let's say, with the "leave one out", maybe I can take
a point at random, and not necessarily insist on going through all the
examples-- do it like 50 times, and take the average.
Or I can take subsets, like in the 10-fold, but I take random subsets
and stop at some point.
So there are variations of those.
The ones I described are the most standard ones.
But there are obviously variations.
And one can do an analysis for them as well.
MODERATOR: Is there any rule for separating data among training,
validation, and test?
PROFESSOR: Random is the only trustworthy thing.
Because if you use your judgment somehow, you may introduce a sampling
bias, which we'll talk about in a later lecture.
And the best way to avoid that for sure, if you sort of flip coins to
choose your examples, then you know that you are safe.
MODERATOR: What's the computational complexity of adding
a cross-validation?
PROFESSOR: I didn't give the formula for it.
Basically, for "leave one out", you are doing N times as much
training as you did before.
The evaluation is trivial. Most of the time goes for the training.
So you can ask yourself, how many training sessions do I have to do now
that I'm using cross-validation, versus what I had to do before?
Before, you had to do one session.
Here, you have to do as many sessions as there are folds.
So 10-fold will be 10 times.
"Leave one out" would be N, because it's really N-fold, if
you want, and so on.
MODERATOR: A clarification--
can you use both regularization and cross-validation?
PROFESSOR: Absolutely.
In fact, one of the biggest utilities for validation is to choose
the regularization parameter.
So inherently in those cases, you do it.
You can use it to choose the regularization parameter.
And then you can also use it on the side, to do something else.
So both of them are active in the same problem.
And in most of the practical cases you will encounter, you will actually be
using both.
Very seldom can you get away without regularization, and very seldom can
you get away without validation.
MODERATOR: Someone is asking that, this seems to be a brute force method for
model selection.
Is there a way to branch and bound how many hypotheses to consider?
PROFESSOR: There are lots of methods for model selection.
This is the only one, at least among the major ones, which does not require
assumptions.
I can do model selection based on, I know my target function is
symmetric, so I'm going to choose a symmetric model.
That can be considered model selection.
And there are a bunch of other logical methods to choose the model.
The great thing about validation is that there are no assumptions
whatsoever.
You have M models.
What are the models? What assumptions do they have?
How close they are, or not close to the target function--
Who cares?
They are M models.
I am going to take a validation set, and I'm going to find this objective
criterion, which is a validation or cross-validation error, and I'm going
to use it to choose.
So it's extremely simple to implement, and very immune to assumptions.
Obviously, if you make assumptions and you know that the assumption are
valid, then you would be doing better than I am doing.
But then you know that the assumptions are valid.
I'm taking a case where I don't want to make assumptions, that I don't know
hold, and still make the model selection.
MODERATOR: In the case where the data depends on time evolution,
how can validation update the model?
Is it used for that, or not?
PROFESSOR: Validation makes a principled choice, regardless of the
nature of that choice.
Let's say that I have a time series, and one of the things in time series--
let's say they're for financial forecasting--
is that, you can train, and then you get a system, and then the world
is not stationary.
So a system that used to work, doesn't work anymore.
You can make choices about, let's say I have a bunch of models, and I
want to know which one of them works at a particular time, given some
conditions.
You can make the model selection based on validation, and then you take
that model and apply it to the real data, or there are a bunch of things
you can do.
But in terms of tracking the evolution of systems, again, if you translate
the problem into making a choice, then you are ready to go with validation.
So the answer is yes.
And the method is to make it spelled out as a choice.
MODERATOR: Another clarification--
so with cross-validation, there's still some bias.
can you quantify why is it better than just regular validation?
PROFESSOR: Both validation and cross-validation will have bias for
the same reasons.
The only question is the reliability of the estimate.
Let's say that I use "leave one out", so here's E_out.
And the bias aside, if I use "leave one out", I'm using all N of
the examples eventually, when I average them.
So the error bar is small.
Granted, it's not as small as it would be if the N errors were independent of
each other.
But it's fairly close to being as if they were independent.
So I get that estimate.
Therefore, anytime you have this estimate, it becomes less
vulnerable to bias.
Because if I have this play, and I'm pulling down, I'm not going to pull
down too far, because I'm still within here.
If I have the other guy which is completely swinging, it's very easy to
pull it down and I get worse effect of the bias.
So whenever you minimize the error bar, you minimize the vulnerability to
bias as well.
That's the only thing that cross-validation does.
It allows you to use a lot of examples to validate, while using a lot of
examples to train.
That's the key.
MODERATOR: Going back to the previous lecture, a question on that.
Can you see the augmented error as conceptually the same as a low-pass
filtered version of the initial error, or not?
PROFESSOR: It can be translated to that under the condition
that the regularizer is a smoothness regularizer, because that's what
low-pass filters do.
So as an intuition, it's not a bad thing to consider
in the case of something like weight decay. It's not going to be strictly
low-pass as in working in the Fourier domain and cutting off, et cetera.
But it will have the same effect of being smooth.
If you have a question,
please step to the microphone, and you can ask it.
So there's a question in house.
STUDENT: Yes.
It seems that cross-validation is a method to deal with limited size of
the data set.
So is it possible in practice that we have a data set so large that
cross-validation is not needed or not beneficial, or do people do it all the
time in principle?
PROFESSOR: It is possible, and one of the cases is the Netflix case,
where you had 100 million points.
So you think at this point, nobody will care about cross-validation.
But it turned out that even in this case, the 100 million points only had
a very small subset which come from the same distribution as the output.
So the 100 million--
again, it's the same question as the time evolution.
You have people making ratings, and different people making different
number of ratings, and this changes for a number of reasons.
Even the same user,
after you rate for a while, you tend to change from the initial rating.
Maybe you are initially excited or something.
So there are lots of considerations like that.
So eventually, the number of points that were patently coming from the
same distribution as the out-of-sample was much smaller than 100 million.
And these are the ones that were used to make big decisions,
like validation decisions.
And in that case, even if we started with 100 million, it might be a good
idea to use cross-validation at the end.
And if you use something like 10-fold cross-validation, then it's not that
big a deal, because you are just multiplying the effort by 10, which
is, given what is involved, not that big a deal.
And you really get a dividend in performance.
And if you insist on performance, then it becomes indicated.
So the answer is yes, because it doesn't cost that much, and because
sometimes in a big data set, the relevant part, or the most relevant
part, is smaller than the whole set.
MODERATOR: Say there's a scenario where you find your model through
cross-validation, and then you test the out-of-sample error.
But somehow you test a different model, and it gives you a smaller
out-of-sample error.
Should you still keep the one you found through cross-validation?
PROFESSOR: So I went through this learning and
came up with a model.
Someone else went through whatever exercise they have and came up with
a final hypothesis in this case.
And I am declaring mine the winner because of cross-validation, and now
we are saying that there's further statistical evidence.
We get an out-of-sample error that tells me that mine is not as good as
the other one.
Then it really is the question of, I have two samples, and I'm doing
an evaluation.
And one of them tells me something, and the other one tells me the other.
So I need to consider first the size of them. That will give me the
relative size of the error bar. And correlations, if any. And bias, which
cross-validation may have, whereas the other one, if it's truly out of
sample, does not.
If I go through the math, and maybe the math won't go through--
it's not always the case--
I will get an indication about which one I would favor.
But basically, it's purely a statistical question at this point.
MODERATOR: When there are few points, and cross-validation is going to be
done, is it a good idea to re-sample to enlarge the current
sample, or not really?
PROFESSOR: So I have a small data set.
That's the premise?
And I'm doing cross-validation.
So what is the--
MODERATOR: So the problem is, since you have few samples,
do you want to re-sample?
PROFESSOR: So instead of breaking them into chunks,
keep taking at random?
Well, I don't have from my experience something that would indicate that one
would win over the other.
And I suspect that if you are close to 10-fold, you probably are close to the
best performance you can get with variations of these methods.
And the problem is that all of these things are not completely pinned down
mathematically.
There is a heuristic part of it, because even cross-validation, we
don't know what the correlation is, et cetera.
So we cannot definitively answer the question of which one is better.
It's a question of trying in a number of problems, after getting the
theoretical guidelines, and then choosing something.
What is being reported here is that the 10-fold cross-validation stood the
test of time.
That's the statement.
MODERATOR: When there is a big class size imbalance, does cross-validation
become a problem?
PROFESSOR: When there is an imbalance between the classes, that is
a bunch of +1's and fewer -1's, there are certain things that need
to be taken into consideration, in order to make learning go through
well-- in order to basically avoid the learning algorithm going for the
all +1 solution, because it's a very attractive one.
So there are a bunch of things that can be taken into consideration, and I
can see a possible role for cross-validation.
But it's not a strong component as far as I can see.
The question of balancing them, making sure that you avoid the all-constant,
and stuff like that will probably play a bigger role.
MODERATOR: How does the bias behave when we increase the number of points
that we leave out?
The size of t if we leave t out.
PROFESSOR: The points we leave out are the validation points.
And if we are using the 10-fold or 12-fold, et cetera, the total number
that go into the summation will be constant, because in spite of the fact
that we're taking different numbers, we go through all of them,
and we add them up.
So that number doesn't change.
MODERATOR: So how does it change, if instead of doing
10-fold, you use 20-fold?
How does that--
PROFESSOR: How does it change?
It doesn't change the number of total points going into the estimate
of cross-validation.
But what was the original question?
MODERATOR: So how does the bias behave?
PROFESSOR: Oh.
Well, given that the total number will give you the error bar, and given that
the bias is really a function of how you use it, rather than something
inherent in the estimate, the error bar will give you an indication of how
vulnerable you are to bias.
Say that, if you take two scenarios where the error bar is comparable, you
have no reason to think that one of them will be more vulnerable to bias
or another.
Now, you need a very detailed analysis to see the difference between taking
one at a time coming from N minus 1, et cetera, and to consider the
correlations, and then taking 1/10 at the time and adding them up, to find
out what is the correlation and what is the effective number of examples,
and therefore what is the error bar.
In any given situation, that would be a pretty heavy task to do.
So basically, that answer is that as long as you do a number of
folds, and you take every example to appear in the cross-validation
estimate exactly once, then there is no preference between them as far as
the bias is concerned.
MODERATOR: I think that's it.
PROFESSOR: Very good.
We'll see you on Thursday.