Tip:
Highlight text to annotate it
X
ANNOUNCER: The following program is brought to you by Caltech.
YASER ABU-MOSTAFA: Welcome back.
Last time, we introduced the notion of overfitting.
The idea was that we are fitting the data all too well, at the expense of
the generalization out of sample.
We took a case where the target was simple, but we
added very little noise.
And that little noise was enough to misguide the fit using a higher-order
polynomial into getting an approximation that is very poor
approximation of the target that we are trying to approximate.
The overfitting as a notion is more in scope than just bad generalization.
If you think of what the VC analysis told us, the VC analysis told us that,
given the data resources and the complexity of the hypothesis set, with
nothing said about the target, given those we can predict the level of
generalization as a bound.
Now, overfitting relates to the target.
For example, in this case the target is noisy, and we have overfitting.
If the target was noiseless, if we had points coming from the blue curve and
we fit them, we would fit them perfectly, because it's a very simple
equation to solve for a polynomial.
And then we will get the blue curve exactly.
Since the VC analysis doesn't tackle the target function, you might be
curious about: are we changing the game,
what is the deal here?
The idea is that the VC doesn't tackle the target, not in the sense
that it doesn't know how to deal with it.
What it does is, it gets you a bound that is valid for all possible
targets, noisy or noiseless.
And therefore, it allows the notion of overfitting.
It gives you a bar for bad generalization.
And you could, within this, have generalization that will be good.
And it could be bad.
Furthermore, it could be that the in-sample error is going down while
out-of-sample error is going up, which is our definition of overfitting.
Or it could be that both of them are going down, but the generalization is
getting worse.
So it doesn't specify to us whether overfitting will happen or not.
Although it doesn't predict it, it allows it.
So now we are zooming in into the details of the theory, and trying to
characterize a situation that happens very often in practice, where the
noise in the target function results in overfitting.
And we can do something about it.
That's why we are actually studying it, because there will be ways to cure
that disease, if you will.
Then we characterized that the source of overfitting is fitting the noise.
And the conventional meaning of the noise we have is what we are referring
to now as stochastic noise, because we are introducing another type.
The idea is that if you fit the noise, this is bad, because you are fitting
something that cannot be fit.
And because you are fitting it, you are predicting or extrapolating
out-of-sample into a non-existing pattern.
And that non-existing pattern will take you away from the target function.
So it will contribute in a harmful way to your out-of-sample error.
But the novel notion was the fact that, even if we don't have stochastic
noise, even if the data is not noisy in the conventional sense, there is
something which we refer to as deterministic noise, which is
a function of the limitations of your model.
So here, your model is 4th-order polynomial.
Other models will give you different deterministic noise.
And they are defined as the difference between the target function, in this
case the blue wiggly curve, and the best approximation within your
hypothesis set to that target function.
Again, it captures the notion of something that we cannot learn at all,
because it's outside of our ability as a hypothesis set.
Therefore, it behaves like a noise.
If we try to fit it, on a finite sample, and try to dedicate some resources to
it, whatever we are learning doesn't make sense, and it will lead to
a pattern that harms the out-of-sample error.
And indeed, we ran an extensive experiment where we compared the
deterministic noise, parameterized by the target complexity.
The more complex the target is, the more deterministic noise we have.
And we found that the behavior, the impact on overfitting, is fairly
similar to the behavior when we increase the stochastic noise in
a similar experiment.
Today, we are going to introduce the first cure for overfitting, which is
regularization.
And next time, we are going to introduce validation, which is the
other side of this.
Regularization is a technique that you will be using in almost every machine
learning application you will encounter.
So it's a very important technique, very important to understand.
There are many approaches to it.
So as an outline, I am going to first talk about it informally, and talk
about the different approaches.
Then I'm going to give a mathematical development of the most famous form of
regularization.
And from that, we are not only going to get the mathematical result, but we are
going to get very good intuition about the criteria for choosing
a regularizer, and we'll discuss it in some detail.
Then we will talk about the ups and downs of choosing a regularizer at the
end, which is the practical situation you will face.
You have a problem.
There is overfitting.
How do I choose my regularizer?
Let's start.
You will find two approaches to regularization in the literature,
which are as vigorous as one another.
One of them is mathematical, purely mathematical.
And it mostly comes from function approximation, where you have
an ill-posed problem.
You want to approximate the function, but there are many functions that
actually fit it, so the problem is ill-posed.
And then you impose smoothness constraints on it, in order to be able
to solve it.
This is a very well-developed area.
And it is borrowed in machine learning.
And actually, the mathematical development I am going to develop
relates to that development.
Also, in the Bayesian approach to learning, regularization is completely
mathematical.
You put it in the prior, and from then on you have a very well-defined
regularizer, in this case.
And in all of those cases, if the assumptions that you made in order to
make the developments hold, then this is the way to go.
There is no reason to go for intuition and heuristics and the other stuff, if
you have a solid assumption and a solid mathematical derivation that
gets you the result.
The problem really is that, in most of the cases you will encounter, the
assumptions that are made here are not realistic.
Therefore, you end up with these approaches having a very careful
deviation, based on assumptions that do not hold.
And it's a strange activity when this is the case.
If you are very rigorous, and trying to get a very specific mathematical
result when your main assumption is not going to hold in the application
you are going to use it in, then you are being penny wise, dollar foolish.
The best utility for the mathematical approach in practical machine learning
is to develop the mathematics in a specific case, and then try to
interpret the mathematical result in such a way that we get an intuition
that applies when the assumptions don't apply, very much like we did
with the VC analysis.
We don't compute the VC dimension and get the bound in every case, but
we got something out of the VC bound, which gives us the behavior of
generalization.
And from then on, we used it as a rule of thumb.
So here we are going to use something similar.
The other approach you will find, is purely heuristic.
And in this case, you are just handicapping the minimization of the
in-sample error, which is putting the brakes, as we described it last time.
And indeed, this is what we are going to do.
We are going to borrow enough from the math, to make this not a completely
random activity, but rather pointed at something that is likely to help our
cause of fighting overfitting.
I am going to start by an example of regularization, and how it affects
overfitting.
And the example will be quite familiar.
You have seen it before.
You probably recognize this picture.
This is a sinusoid.
And we had the funny problem, where we had only two points in the training
set, so N equals 2.
And we were fitting our model, which was a general line in this case.
So we pass the line through the two points.
And we get a variety of lines, depending on the data set you have.
And we noticed, after doing a careful analysis of this using the
bias-variance analysis, that this is really bad.
And the main reason it's bad is because it's all over the place.
And being all over the place results in a high variance term.
That was the key.
In that case, a simplistic constant model, where you approximate the sine
by just a constant, which ends up being a zero on average,
is actually better in performance out-of-sample than
fitting with a line.
That was the lesson that we got there.
So let's see if we can improve the situation here by regularizing it, by
controlling the lines.
Instead of having wild lines, we are going to have mild lines, if you will.
What we're going to do, we are going to not let the lines be
whatever they want.
We are going to restrict them in terms of the offset they can have, and the
slope they can have.
That is how we are putting the brakes on the fit.
Obviously, we are sacrificing the perfect fit on the training set when
we do that.
But maybe we are going to gain.
Yet to be seen.
So this would be without regularization, using our new term.
And when you have it with regularization, and put the constraint
on the offset and the slope, these are the fits you are going to get on the
same data sets.
Now you can see that they are not as crazy as the lines here.
Each line tries to fit the two points.
It doesn't fit them perfectly, because it is under a constraint that prevents
it from passing through the points perfectly.
Nonetheless, it looks like the great variance here has
been diminished here.
But we don't have to judge it visually.
We can go to our standard quantities, the bias and variance, and
do a complete analysis here, and see which one wins.
So let's see who the winner is.
This is without regularization, versus with regularization.
We have seen without regularization before.
This was the case where, if you remember, this red guy is the average
line you get.
It is not a hypothesis that you are going to get in any
given learning scenario.
But it is the average of all the lines people get, when they get different
data sets of two points each.
And around that is a great variance, depending on which two points you get.
And this is described as a standard deviation, by this region.
The width of the gray region is what killed us, in that case, because the
variance is so big that, in spite of the fact that if you have an infinite
number of data sets, each with two points, you will get the red thing
which is not bad at all.
But you don't get that.
You will get only two points.
So sometimes you will be doing something like that, instead of this.
And on average, the out-of-sample error will be terrible.
Let's look at the situation with regularization.
As expected, the gray region has diminished, because the
lines weren't as crazy.
If you look at the red line, the red line is a little bit different,
because we couldn't fit the points,
we couldn't fit the points, so there's a little bit of an added bias, because
the fit is not perfect.
And we get this.
Now regularization, in general, reduces the variance at the expense,
possibly, of increasing the bias just a little bit.
So think of it that I am handicapping the fit.
Well, you are handicapping the fit on both the noise and the signal.
You cannot distinguish one from another.
But the handicapping of the noise is significant.
That's what reduced the variance.
The handicapping of the fit results in a certain loss of the quality of the
fit. That is reflected by that.
Let's look at the numbers and see that, actually, this stands to the reality.
The bias here was 0.21.
We have seen these numbers before.
And the variance was a horrific 1.69.
And when we added them up, the linear model lost to the
simplistic constant model.
So let's look at with regularization.
Now we are using still the linear model, but we are regularizing it.
And you get a bias of 0.23.
Well, that's not too bad.
We lost a little bit.
Think of it as a side effect of the treatment.
You're attacking the disease, which is overfitting.
And you will get some funny side effects.
So instead of getting the 0.21, you are getting 0.23.
OK, fine.
How about the variance?
Totally dramatic, 0.33.
And when you add them up, not only do you win over the unregularized guy.
You also win over the constant model.
If you get the numbers for the constant model, this guy wins.
And that has a very interesting interpretation.
Because when you are trying to choose a model, you have the constant, and
then you have the linear, and then you have the quadratic.
This is sort of a discrete grid.
Maybe the best choice is actually in between these guys.
And you can look at regularization as a way of getting the
intermediate guys.
There is a continuous set of models that go from extremely restricted to
extremely unrestricted.
And therefore, you fill in the gap.
And by filling in the gap, you might find the sweet spot that gives you the
best out-of-sample error.
In this case, we don't know that this is the best out-of-sample error, for
the particular level of regularization that I did.
But it certainly beats the previous champion, which was
the constant model.
Knowing this, we would like to understand what was the regularization,
in specific terms, that resulted in this.
And I'm going to present it in a formal setting.
And in this formal setting, I am going to give a full mathematical
development, until we get to the solution for this regularization,
which is the most famous regularization you will encounter in
machine learning.
My goal is not mathematics for the sake of mathematics.
My goal is to get to a concrete conclusion in this case, and then read
off that conclusion what lessons can we learn, in order to be able to deal
with a situation which is not as ideal as this one, which indeed we will
succeed in.
So let's look at the polynomial model.
We are going to use polynomials, as the expanding components.
And we are using Legendre polynomials, which I
alluded to last time.
There is nothing mysterious about them.
They are polynomials, as you can see.
And L_2 is of order two.
L_3 is of order three, and so on.
And the only thing is that, they are created such that they would be
orthogonal to each other, which would make the mathematics nice, and will
make it such that when we combine them using coefficients, the coefficients
can be treated as independent.
They deal with different coordinates that don't interfere with each other.
If we use just the monomials, the monomials are extremely correlated.
Therefore, the relevant parameter, as far as you're concerned, would be
a combination of the weights, rather than an individual weight.
So this saves you by getting the combinations ahead of time, so that the
weights actually are meaningful in their own right.
That's the purpose here.
So what is the model?
The model will be H_Q, which is, by definition, the polynomials of
order Q. And the nonlinear transformation that takes the scalar
variable x, and produces this polynomial, is given by
this vector, as usual.
You start with the mandatory 1, and then you have Legendre
of order 1 up to Legendre
of order Q.
When you combine these linearly, you are going to get a polynomial of order
Q, not a weird polynomial of order Q, just a regular polynomial of order Q,
just represented in this way.
If you actually sum up all of the coefficients, there will be
a coefficient for constant, coefficient for x, coefficient for x squared, up
to x to the Q.
Using these polynomials, the formal parameterization of the hypothesis set
would be the following.
You take these guys, and give them weights.
And these weights are the parameters that will tell you one hypothesis
versus the other.
And you sum up over the range that you have.
And this will be the general hypothesis, in this hypothesis set.
Because it has that nice form, which is linear, we obviously are going to
apply the old-fashioned linear regression, in the Z space, in order to
find the solution.
It will be a very easy analytic solution because of this.
Let me just underline one thing.
I am talking here about the hypothesis set.
I am using the Legendre polynomials, and this model, in order to construct
the hypothesis set.
I didn't say a word about the target function.
The target function here is unknown.
And the reason I am saying that is because, last time in the experiment
for overfitting, I constructed the target function
using the same apparatus.
And I did it just because the overfitting depended on the target
function, and I wanted it to pin it down.
But here, the target function goes back to the normal learning scenario.
The target function is unknown.
And I am using this as a parameterized hypothesis set, in order to get a good
approximation for the target function using a finite training set.
That's the deal.
Let's look at the unconstrained solution.
Let's say I don't have regularization.
This is my model.
What do you do?
We have seen this before.
I am just repeating it, because it's in the Z space, and in order to refresh
your memory.
So you are given the examples x_1 up to x_N with the labels, the labels being
real numbers, in this case.
And x_1 up to x_N, I'm writing them as a scalar, because they are.
And then I transform them into the Z space, so I get a full vector
corresponding to every x, which is the vector of the Legendre polynomials.
I evaluate it at the corresponding x, so I get this.
And my goal of the learning is to minimize the in-sample error.
The in-sample error will be function of the parameters, w.
And this is the formula for it. Exactly, it's
a squared error formula that we used for linear regression.
So you do this, which is the linear combination in the Z space.
Compare it to the target value, which is y_n.
The error measure is squared.
You sum up over all the examples and normalize by N. So, this is indeed the
in-sample error as we know it.
And we put it in vector form,
if you remember this one.
So all of a sudden, instead of z as vector, we have Z as a matrix.
And instead of y as a scalar, we have y as a vector.
So everybody got promoted.
The matrix Z is where every vector z is a row.
So you have a bunch of rows describing this.
It's a tall matrix if you have a big training set, which
is the typical situation.
And the vector y is the corresponding vector of the labels y.
When you put it in vector form, you have this equals that.
Very easy to verify.
And it allows us to do the operations in a matrix form, which is
much easier to do.
So we want to minimize that.
And the solution we are going to call w_lin, for linear
regression in this case.
And we have the form for it.
It's the one-step learning.
It happens to be the pseudo-inverse, now in the Z space,
so it has this form.
If I give you the x's, and you know the form for the Legendre polynomials,
you compute the z's.
You have the matrix.
You have the labels.
You plug it into the formula, and you have your solution.
So this is an open and shut case.
Let's look at the constrained version.
What happens if we constrain the weights?
Now come to think of it, we have already constrained the weights in one
of the applications.
We didn't say that we did, but that's what we effectively did.
Why is that?
We actually had a hard constraint on the weights, when we used
H_2 instead of H_10.
Remember, H_2 is a 2nd-order polynomial.
H_10 was a 10th-order polynomial.
Wait a minute.
These are two different hypothesis sets.
I thought the constraint was going into one hypothesis set, and then playing
around with the weights.
Yes, that's what it is.
But one way to view H_2 is, as if it was H_10, with a constraint.
And what would that constraint be?
Just set all the parameters to zero, above power 2.
That is a constraint.
But obviously that's an extreme case.
What we mean in a constraint, usually with regularization, is something
a little bit softer.
So here is the constraint we are going to work with.
We are going to work with a budget C for the total magnitude
squared of the weights.
Before we interpret this, let's first concede that this is indeed
a constraint. The hypotheses that satisfy this are a proper subset of
H_Q, because I have excluded the guys that happen to have weights
bigger than that.
Because of that, I am already ahead, using the VC analysis that
I have in my mind.
I have a smaller hypothesis set, so the VC dimension is going in the
direction of being smaller.
So I am standing a chance of better generalization.
This is good.
Now, interpreting this as something along the same lines here,
instead of setting some weights to 0, which is a little bit hard, you just
want them, in general, to be small.
You cannot have them all big.
So if you have some of them 0, that leaves more in the budget for you to
play around with the rest of the guys.
And because of this, if you think of this as a hard-order constraint, that
is you say it's 2.
Anything above 2 is zero.
Here you can deal with it as if it's a soft-order constraint.
I'm not really excluding any orders.
I'm just making it harder for you to play around with all the powers.
Now let's look at the problem, given that this is the constraint.
You are still minimizing the in-sample error.
But now you are not free to choose the w's here any which way you want.
You have to be satisfying the constraint.
So that minimization is subject to, and you put the constraint in vector
form, and this is what you have.
This is now the problem you have.
When you solve it, however you do that, we are going to call the
solution w_reg, for regularization, instead of w_lin, for linear
regression.
And the question is, what happens when you put that constraint?
What happens to the old solution, which is w_lin?
Given w_reg, which one generalizes better?
What is the form for each, et cetera?
Let's see what we do to solve for w_reg.
You are minimizing this, subject to the constraint.
I can do this mathematically very easily using Lagrange multipliers, or
the inequality version of Lagrange multipliers, KKT, which I will
actually use in the derivation of support vector machines next week.
But here, I am just going to settle for a pictorial proof of what the
solution is, in order to motivate it.
And obviously, after you learn the KKT, you can go back and verify that
this is indeed the solution you get analytically.
So let's look at this.
I have two things here.
I have the error surface that I'm trying to minimize, and I have the
constraint.
So let's plot both of them in two dimensions, because
that's what we can plot.
Here is the way I'm drawing the in-sample error.
I am putting contours where the in-sample error is a constant.
So inside it will be a smaller E_in, smaller E_in,
and outside it will be bigger E_in, et cetera.
But on all points on that contour, which actually happens to be the
surface of an ellipsoid, if you solve it analytically, the
E_in is the same value.
When you look at the constraint, the constraint tells you to
be inside this circle.
So let's look at the centers for these guys.
What is the center for here?
Well, the center for here is the minimum possible in-sample error you
can get without a constraint.
And that we already declared to be w_lin, the solution for linear
regression.
So that is where you achieve the minimum possible E_in.
And as you go further and further, the E_in increases.
Now here's the constraint.
What is the center of the constraint?
Well, the center of the constraint is the origin, just because of
the nature of it.
Now the idea here is that you want to pick a point within this disc, such
that it minimizes that.
It shouldn't be a surprise to you that I will need to go as far out as I can
afford to, without violating the constraint, because this gets me
closer to that.
So the visual impression here is actually true mathematically.
Indeed, the constraint that you will actually end up working with is not
that w T w is less than or equal to C, but actually equals C. That is where the
best value for E_in, given the constraint, would occur.
It will occur at the boundary.
Let's look at a possible point that satisfies this, and try to find
an analytic condition for the solution.
Before we do that, let's say that the constraint was big enough to include
the solution for linear regression, that is, C is big enough that this is
the big circle.
What is the solution?
You already know it.
It's w_lin.
Because that is the minimum absolute, and it happens to be allowed by the
constraint.
So this is the solution.
The only case where you are interested in doing something new, is when the
constraint takes you away from that.
And now you have to find a compromise between the objective and the
constraint.
The compromise is such that you have to obey the constraint.
There is no compromise there.
But given that this is the condition, what would be the best you can get, in
terms of the in-sample error?
Let's take a point on the surface.
This is a candidate.
I don't know whether this gives you the minimum.
I don't think it will give me, because I already said that it should be as
close as possible to the outside.
But let's see.
Maybe this will give us the condition.
Let's look at this point, and look at the gradient of
the objective function.
The gradient of the objective function will give me a good idea about
directions to move in order to minimize E_in, as we have done before.
So if you draw this, you'll find that the gradient has to be orthogonal to
the ellipse, because the ellipse, by definition, has the
same value of E_in.
So the value of E_in does not change as you move along this.
The only change it is allowed will have to be orthogonal to this.
So the direction of the gradient will be this way.
And I'm putting it outside, because E_in grows as you move away from w_lin.
So that's one vector.
Now let's look at the orthogonal vector to the other
surface, the red surface.
That's not a gradient of anything yet.
But if we draw it, it looks like that.
And then I find out that this is, what?
This is just w.
If I take a point here, this is the origin.
This is the vector.
It happens to be orthogonal.
So this is the direction of the vector w.
This is the direction of the vector, the gradient of E_in.
Now by looking at this, I can immediately tell you that w does not
achieve the minimum of this function, subject to this constraint.
How do I know that?
Because I look at these, and there is an angle between them. If I move
in this direction, E_in will increase.
If I move in this direction, E_in will decrease.
I wouldn't be having that situation, if they were exactly the
opposite of each other.
Then I would be moving, and nothing will happen.
But now E_in has a component along the tangent here.
And therefore, moving along this circle will change the value of E_in.
And if I increase it and decrease it by moving, then definitely this does
not achieve the minimum of E_in.
So I keep going until I get the point where I achieve the minimum of E_in.
And at that point, what would be the analytic condition?
The analytic condition is that this guy is going in one direction, this
guy is going in exactly the opposite direction.
So let's write the condition.
The condition is that the gradient, which is the blue guy, is proportional
to the negative of w of your solution.
Because now we declared it the solution.
This is the value at which you achieve the optimal, under the constraint.
We've already called that w_reg.
So at the value of w_reg, the gradient should be proportional to
the negative of that.
Now because it's proportional to the negative of it, I'm going to put the
constant of proportionality in a very convenient way for further derivation.
I'm going to write it as minus,
twice--
I'm going to differentiate a squared somewhere, and I don't want the 2 to
hang around, so I'm putting it already--
lambda, that is my generic parameter.
And I'll divide it by N. Of course, I'm allowed to do that, because there
is some lambda that makes it right, so I'm just
putting it in that form.
When I put it in this form, I can now go-- this is the
condition for w_reg.
This equals minus that.
I can move things to the other side.
And now I have an equation which is very interesting.
I have this plus that, equals the vector 0.
Now this looks suspiciously close to being the gradient of something.
And if it happens to be the minimum of a function, then I can say, the
gradient is 0, so that corresponds to the minimum of whatever that is.
So let's look at what this is the differentiation of.
It's as if I was minimizing.
E_in gives me this fellow.
And conveniently, this fellow gives me this fellow, when I differentiate.
So the solution here is the minimization of this guy.
That's actually pretty cool.
Because I started with a constrained optimization problem, which is fairly
difficult to do in general.
You need some method to do that.
And by doing this logic, I ended up with minimizing something,
unconditionally.
Just minimize this, and whatever you find will be your solution.
And here we have a parameter lambda, and here we have a parameter C. They
are related to each other.
And actually, parameter lambda depends on C, depends on the data set, depends
on a bunch of stuff.
So I'm not going to even attempt to get lambda analytically.
I just know that there is a lambda.
Because when we are done, you'll realize that the lambda we get for
regularization is decided by validation, not by solving anything.
So we don't have to worry about it yet.
But it's a good idea to think of what is C, related to lambda, just to be
able to relate to that translation of the problem, from the constrained
version to the unconstrained version.
The idea is that C goes up, lambda goes down, and vice versa.
So let's start with the formula.
What happens if C is huge?
Well, if C is huge, then w_lin is already the solution.
And therefore, you should be just minimizing E_in, as if there was
nothing, no constraint.
But that does correspond to lambda equals 0, doesn't it?
You will be minimizing E_in.
So if C is huge, lambda is 0.
Now let's get C smaller, and smaller.
When C is smaller, the regularization is more severe, because the condition
now is becoming more severe.
And in order to make the condition here more severe, in terms of the
regularization term, you need to increase lambda.
The bigger lambda is, the more emphasis you have to put on the
regularization part of the game.
And therefore, indeed, if C goes down, lambda goes up.
To the level where, let's say that C is 0. What is the solution here?
Well, you just left me one point in the domain.
I don't care what E_in is.
It happens to be the minimum, because it's the only value.
So the solution is whatever the value is, so w equals 0 is the solution.
How do you force this to have the solution w equals 0?
By getting lambda to be infinite, in which case you don't care about the
first term.
You just absolutely positively have to make w equal to 0.
So indeed, that correspondence matters.
So we put it there.
And we understand in our mind, there are two parameters that are
related to each other.
Analytically, we didn't find them.
But now we have a correspondence.
And the form we have here will serve as our form.
We have to be able to get lambda in a principled way, which we will.
This is the only remaining outstanding item of business.
Now let's look at augmented error, which is an interesting notion.
If you are minimizing E augmented, what is E augmented?
We used to minimize E_in.
Now we augmented it with another term, which is a regularization term.
So we write it down this way.
And this simply can be written, for this particular case.
Because E_in is no mystery.
We have a formula for it.
You look at this, and now this looks very promising.
If I asked you to solve this, this used to be a quadratic form, and now
it's a quadratic form.
So I don't think the solution would be difficult at all.
But the good news is that solving this is equivalent to-- which is
unconditional optimization, unconstrained optimization-- solves the
following problem.
You minimize E_in by itself, which we have the formula for, subject to the
constraint.
It's an important correspondence, because of the following.
The bottom formulation of the problem lends itself to the VC analysis.
I am restricting my hypothesis set, explicitly.
There are certain hypotheses that are no longer allowed.
I am using a subset of the hypothesis set.
I expect good generalization.
Mathematically, this is equivalent to the top one.
If you look at the top one, I am using the full hypothesis set, without
explicitly forbidding any value.
I am just using a different learning algorithm to find the solution.
Here, in principle, minimize this.
Whatever the solution happens, happens.
And I'm going to get a full-fledged w that happens to be member of H_Q,
my hypothesis set.
Nothing here is forbidden.
Certain things are more likely than others, but that's
an algorithmic question.
So it will be very difficult to invoke a VC analysis here, but it's easy to
invoke it here.
And that correspondence between a constrained version, which is
the pure form of regularization as stated, and an augmented error,
which doesn't put a constraint, but adds a term that captures the
constraint in a soft form, that correspondence is the justification of
regularization in terms of generalization, as far as VC analysis
is concerned.
And it's true for any regularizer you use.
We are just giving here an example for this particular type of regularizer.
Now, let's get the solution.
That's the easy part.
We minimize this.
Not subject to anything.
And this is the formula for it.
What do you do?
You get the gradient of it equated to 0.
Can anybody differentiate this, as we have done it before?
That results in--
This is the solution.
So this is the part we got from the first part, as we got in linear
regression.
That's what got us the pseudo-inverse solution in the first place.
And the other guy conveniently gets lambda.
You can see why I chose the parameter to be funny.
The 2 was because of the differentiation.
Now I have squared.
The over N, because this one has 1 over N, so I was able to factor 1/N
out, and leave lambda here, which is clean.
That's why I chose the constant of proportionality in that particular
functional form.
So I get this, and solve it.
And when you solve it, you get w_reg.
That's the formal name of the solution to this problem.
And that happens to be-- it's not the pseudo-inverse, but it's not that
far from it.
All you do is,
you just group the w guys, and then get the y on the other side, and do
an inverse, and that's what you get.
So this would be the solution, with regularization.
And as a reminder to us, if we didn't have regularization, and we were
solving for w_lin, w_lin would be simply this fellow, the regular
pseudo-inverse, which you can also get by simply setting lambda to 0 here.
Let's look at the solution. We had this without regularization.
And let's put this, because this is the one we're going to use with
regularization.
Now this is remarkable, in this case-- under the assumptions, under the clean
thing-- we actually have one-step learning, including regularization.
You just tell me what it is, and I actually have the solution outright.
So instead of doing a constrained optimization, or doing it in increments
or that, this is the solution.
That's a pretty good tool to have.
It also is very intuitive.
Because look at this.
If lambda is 0, you have the unconstrained, and you have without
regularization.
As you increase lambda, what happens?
The regularization term becomes dominant in the solution.
This is the guy that carries the information about the inputs.
This guy is just lambda 'I'.
Now take it to the extreme.
Let's say lambda is enormous.
Well, if lambda in enormous, this completely dominates this.
And the result of getting this-- this would be about lambda 'I'. The other
guy is just noise.
And when I invert it, I will get something like 1 over lambda.
So w_reg would be 1 over lambda, for lambda huge, times something.
Who cares about the something?
1 over lambda is huge.
It will knock it down to 0.
I am going to get w_reg that is very close to 0.
And indeed, I am getting smaller and smaller w_reg solution, given that
lambda is large, which is what I expect.
And in the extreme case, I am going to be forced to have w equal to 0, which
is the case we saw before as the extreme case.
So this, indeed, stands to the logic of what we expect.
We have the solution.
Let's apply it, and see the result in a real case.
So we're now minimizing this, but we know what the solution is explicitly.
And what I am going to do, I'm going to vary lambda, because this will be
a very important parameter for us.
So we have the same regularizer, w transposed w, and I'm going to vary the
amount of regularization I put.
And I'm going to apply it to a familiar problem.
This is for different lambdas.
Remember this problem?
Yeah, we saw it last time.
Actually, we saw it earlier this lecture.
So this is the case.
Now, we are going to put it in the new terminology.
What is this?
This is unconstrained.
Therefore, it is really constrained, but with lambda equals 0.
Now let's put a little bit of regularization.
And here's what I mean by a little bit.
Is this a little bit for you?
Let's see the result.
Wow.
This is the guy I showed you last time, just as an appetizer.
Remember?
That's what it took.
So the medicine is working.
A small dose of the medicine did the job.
That's good.
Let's get carried away, like people get carried away with medicine, and
get a bigger dose.
What happens?
Oops.
I think we are overdosing here.
Let's do it further.
You can see what's happening.
I'm constraining the weights.
And now the algorithm, all it's doing is just constraining the weights, and
it doesn't care as much about the fit.
So the line keeps getting flatter and more horizontal, until there is
absolutely nothing in the line.
If you keep increasing lambda, this was a line that used to exactly fit,
and now the curvature is going small, and the slope is really mitigated.
And the curve is getting small, et cetera.
And eventually what will happen?
This will be just a silly horizontal line.
You have just taken a fatal dose of the medicine!
That's what happened.
When you deal with lambda, you really need to understand that the choice of
lambda is extremely critical.
And the good news is that, in spite of the fact that our choice of type of
regularizer, like the w transposed w in this case, that choice
will be largely heuristic.
Studying the problem, trying to understand how to pick a regularizer,
this will be a heuristic choice.
The choice of lambda will be extremely principled, based on validation.
And that will be the saving grace, if our heuristic choice for the
regularizer is not that great, as we will see in a moment.
If you want to characterize what's happening as you increase lambda, here
we started with overfitting.
That was the problem we were trying to solve.
And we solved it.
And we solved it.
And we solved it all too well.
We are certainly not overfitting here.
But the problem is that we went to the other extreme.
And now we are underfitting, just as bad.
So the proper choice of lambda is important.
Now, the regularizer that I described to you is the most famous regularizer
in machine learning.
And it's called weight decay.
And the name is not very strange, because we're trying to get the
weights to be small, so decay is not a far term.
But I would like to understand why it is actually called,
specifically, decay.
The reason is the following.
Let's say that you are not in a neat, linear case like that.
Let's say you are doing this in a neural network.
And in neural networks, weight decay-- trying to minimize w transposed w,
is a very important regularization method.
We know that in neural networks, you don't have a neat closed-form
solution, and you use gradient descent.
So let's say you use gradient descent on this.
And let's say just batch gradient descent, for the simplicity of the
derivation.
What do you do?
Batch gradient descent, you have a step that takes you from w at time t
to w time t plus 1.
And they happen to be this minus eta, which is the learning
rate, times the gradient.
So we just need to put the gradient, and we have our step.
Right?
The gradient is the following.
The gradient is the gradient of the sum of this.
The gradient of the first part is what we had before.
If we didn't have regularization, that's what we would be doing.
And that is what happens.
And we got backpropagation.
But now there is an added term, because of this.
And that added term looks like that, just by differentiating.
So now, if I reorganize this by taking the terms that correspond to w(t) by
themselves, I am going to get this term, basically collecting these two
fellows, this guy and this guy, which happen to be multiplied by w(t).
And then I have this remaining guy, which I can put this way.
Now look at the interpretation of the step here.
I am in the weight space, and this is my weight.
And here is the direction that backpropagation is
suggesting that I move to.
It used to be, without regularization, that I'm moving from here to here.
Right?
Now using this thing, before I do that, which I'm going to do, I am
actually going to shrink the weights.
Here's the origin.
I am here.
I'm going to move in this direction.
Because this fellow is a fraction.
And it could be a very small fraction, depending on lambda.
I could be going by a factor of a half, or something.
Most likely, I'll go by very little, like 0.999.
But in every step now, instead of just moving from this according to the
solution, I am shrinking then moving, shrinking then moving,
shrinking then moving.
So these guys are informative.
They tell me about what to do, in order to approximate the function.
This guy is just obediently
trying to go to 0.
That makes you unable to really escape very much.
If I was just going this way, this way, that way, et cetera, I would be
going very far.
But here now, every time, there is something that grounds you.
And if you take lambda to be big enough, that that fraction is huge,
then your learning would be, here, and this is a suggested direction.
I'm going to do it.
But before I do it, I'm going to go here.
Then you move this way, and the next time you go here.
And before you know it, you are at the origin, regardless of what the other
guy is suggesting.
And that is indeed what happens when lambda is huge.
You are so tempted towards the 0 solution that you don't really care
about learning the function itself.
The other factor pushes you over there.
So that's why it's called weight decay, because the weight decays
from one iteration to the next.
And it applies to neural networks.
All you need to remember in neural networks, the w transposed w is a pretty
elaborate sum.
You have to sum over all of the layers, all the input units, all the
output units.
And you sum the value of that weight squared.
So that's what you have.
Now let's look at variations of weight decay.
This is the method we developed.
And we would like to move to other regularizers, and try to infer some
intuition about the type of regularizer we pick.
So what do we do here?
You can, instead of just uniformly giving a budget C, and having the sum
of the w's squared being less than or equal to C, you can decide that some
weights are more important than others.
And the way you do it is by having this as your regularizer.
You introduce an importance factor.
Let's call it gamma.
And by the choice of the proper gamma--
these are constants that specify what type of regularizer
you are working with--
if this becomes less than or equal to C, now you have a play.
If gamma is very small, I have more liberty of making that weight big,
because it doesn't take much from the budget.
If gamma is big, then I'd better be careful with the corresponding weight,
because it kills the budget.
Let's look at two extremes.
Let's say that I take the gamma to be positive exponential.
How do you articulate what the regularizer is doing?
Well, the regularizer is giving huge emphasis on higher-order terms.
So what is it trying to do?
It is trying to find, as much as possible, a low-order fit.
Let's say Q equals 10.
If it tries to put a 10th-order polynomial, the smallest weight for the
10th order will kill the budget already.
Let's look at the opposite.
If you have that, well, you find it.
Now, the bad guys are the early guys.
I'm OK with the high-order guys, but not the other guys.
So this would be a high-order fit.
You can see quite a variety of this.
And in fact, this functional form is indeed used in neural networks, not
for high-order or low-order, but for something else.
It is used because, when you do the analysis properly for neural networks,
you find that the best way to do weight decay is to give different
emphasis to the weights in different layers.
They play a different role in affecting the output.
And therefore, this would be accommodated by just having the proper
gamma in this case.
The most general form of this type of things is the famous Tikhonov
regularizer, which is a very well-studied set of regularizers that
has this general form.
This is a quadratic form, but it's a diagonal quadratic form.
I only take the w_0 squared, w_1 squared, up to w_Q squared.
This one, when I put it in matrix form, is a general quadratic form.
It has the diagonal guys, and it has off-diagonal guys.
So it will be giving weights to guys that happen to be w_1 w_3, et cetera.
And by the proper choice of the matrix Gamma in this case, you'll get weight decay,
you will get the low-order, high-order, and many others that are fit in that.
Therefore, studying this form is very interesting, because you cover
a lot of territory using it.
So these are some variations.
Now let's even go more extreme, and go for not weight
decay, but weight growth.
Why not?
The game was what?
The game was constraining, right?
You don't want to allow all values of the weights.
You didn't allow big weights.
I'm going not to allow small weights.
What's wrong with that?
It's a constraint.
OK, it's a constraint,
let's see how it behaves.
First, let's look at weight decay.
I'm plotting the performance of weight decay, the expected out-of-sample
error, as a function of the regularization parameter lambda.
There is an optimal value for the parameter, like we saw in the example,
that gives me the smallest one.
And before that, I am overfitting, and after that, I am starting to underfit.
And there's a value.
Any time you see the curve going down and then going up, it means that
that regularizer works, if you choose lambda right.
Because if I choose lambda here, I am going to get better out-of-sample
performance than if I didn't use regularization at all, which is
lambda equals 0.
Now we are going to plot the curve for if we constrain the
weights to be large.
So your penalty is for small weights, not for large weights.
What would the curve look like again here?
If it goes down from 0 to something, that's fine.
But it looks like this.
It's just bad.
But it's not fatal, because what?
Because our principled way of getting lambda got us lambda equals 0 as the
proper choice.
So we killed the regularizer altogether.
But it's a curious case.
Because now we are using regularizers.
It seems like you can even use a regularizer that harms you.
And I'm not sure now that I need to--
I should use a regularizer--
You have to use a regularizer, because without a regularizer, you are going
to get overfitting.
There is no question about it.
It's a necessary evil.
But there are guidelines for choosing the regularizer that I'm going
to talk about now.
And after you choose the regularizer, there is the check of the lambda.
If you happen to choose the wrong one, and you do the correct validation, the
correct validation will recommend that you give the weight 0.
So there is no downside, except for the price you pay for validation.
So what is the lesson here?
It's a practical rule.
I am not going to make a mathematical statement here.
What is the criterion that we learned from weight decay that will guide us
in the choice of a regularizer?
Here is the observation that leads to the practical rule.
Stochastic noise, which we are trying to avoid fitting,
happens to be high-frequency.
That is, when you think of noise, it's like that.
Whereas the usual target functions are this way.
So this guy is this way.
How about the other type of noise, which is also a culprit for
overfitting?
Well, it's not as high-frequency.
But it is also non-smooth.
That is, we capture what we could capture by the model.
And what we left out, the chances are we really couldn't capture it
because it's going up and down, faster or stronger than we can capture.
Again, I am saying this is a practical observation.
This happens in most of the hypothesis sets that I get to choose, and the
target functions that I get to encounter.
And because of this, here is the guideline for choosing a regularizer.
Make it tend to pick smoother hypotheses.
Why is that?
We said that regularization is a cure, and the cure has a side effect.
It's a cure for what?
For fitting the noise.
So you want to make sure that you are punishing the noise, more than you are
punishing the signal.
These are the organisms we are trying to fight.
If we harm them more than we harm the patient, we'll be OK.
We'll put up with the side effect, because we are killing the disease.
These guys happen to be high-frequency.
So if your regularizer prefers smooth guys, it will fail to fit these guys
more than it will fail to fit the signal.
That is the guideline.
And it turns out that most of the ways you mathematically write a hypothesis
set, as a parameterized set, is by making smaller weights correspond to
smoother hypotheses.
I could do it the other way around.
Instead of my hypothesis being a summation of w times a polynomial, I
can make a summation of 1/w times a polynomial.
These are my parameters, in which case big weights will
be better, smoother.
But that's not the way people write hypothesis sets.
In most of the parametrization you're going to see, small weights correspond
to smoother hypotheses.
That's why small weights, or weight decay, works very well in those cases.
Because it has a tendency towards smooth guys.
Now let's write the general form of regularization, and then talk about
choosing a regularizer.
We are going to call the regularizer, like the weight decay regularizer by
itself without the lambda.
We are going to call it Omega.
And it happens to be Omega of h.
It used to be function of w. w are the parameters that determine h.
So if I now leave out the parameters explicitly, and I'm talking about the
general hypothesis set, it depends on which hypothesis you pick.
The value of the regularizer is that.
And the regularizer will prefer the guys for which Omega of h
happens to be smaller in value.
So you define this function, and you have defined a regularizer.
What is the augmented error that we minimize?
In this case, the augmented error is, again, the augmented error of the
hypothesis.
It happens to be of the weight, if that is the way you parameterize your
hypotheses.
And we write it down as this.
This is the form we had.
You get E_in.
That, already, we have.
And then you have the lambda, the important parameter, the dose of the
regularizer, and the form of the regularizer itself, which we just
called Omega of h.
So this is what we minimize.
Does this ring a bell?
Does it look like something you have seen before?
Well, yeah, it does.
But I have no idea what the relation might possibly be.
I have seen this one before from the VC analysis.
But it was a completely different ball game.
We were talking about E_out, not E_aug.
We were not optimizing anything.
This was less than or equal to.
Less than or equal to is fine, because we said that the behavior is generally
proportional to the bound.
So that's fine.
This is E_in, so that's perfect.
This guy is Omega.
Oh, I'm sneaky.
I called this Omega, deliberately.
But this one was the penalty for model complexity.
And the model was the whole model.
This is not a single hypothesis.
You give me the hypothesis set, I came up with a number that tells you how
bad the generalization will be for that model.
But now let's look at the correspondence here.
This is a complexity, and this is a complexity.
Although the complexity here is for individual hypotheses.
That's why it's helpful for me to navigate the hypothesis set.
Whereas this was just sitting here as an estimate.
When I talk about Occam's razor, I will relate the complexity of
an individual object, to the complexity of the set of objects, which is a very
important notion in its own right.
But if you look at that correspondence, you realize that what
I'm really doing here, instead of using E_in, I am using E_aug as
an estimate for E_out, if I take it literally.
And the thing here is that E_aug, the augmented error, is better than E_in.
Better in what?
Better as a proxy to E_out.
You can think of the holy grail of machine learning, is to find
an in-sample estimate of the out-of-sample error.
If you get that, you're done.
Minimize it, and go home.
But there's always the slack, and there are bounds, and this and that.
And now our augmented error is our next attempt, from using the plain
vanilla in-sample error, adding something up that gets us closer to
the out-of-sample error.
Of course, the augmented error is better than E_in in approximating
E_out, because it's purple.
Purple is closer to red than blue!
OK, no.
That's not the reason.
But that's at last the reason for the slide.
So this is the idea in terms of the theory.
We found a better proxy for the out-of-sample.
Now, very quickly, let's see how we choose a regularizer.
I say very quickly, not because of anything, but because it's really
a heuristic exercise, and I want to emphasize a main point here.
What is the perfect regularizer?
Remember when we talked about the perfect hypothesis set?
This was the hypothesis set that has a singleton, that happens to be our
target function.
Dream on!
We don't know the target function.
We cannot construct something like that.
Well, the perfect regularizer is also one that restricts, but in the
direction of the target function.
I think we can say that we are going in circles here.
We don't know the target function.
Now if you know a property of the target function, that allows you to go
there, that is not regularization.
There's another technique which uses properties of the target function, in
order to improve the learning.
Explicitly, this property holds for the target function, and there is
a prescription for how to use it.
Regularization is an attempt to reduce overfitting.
So it is not matching the target.
It doesn't know the target.
All it does is apply generically a methodology that harms the overfitting
more than it harms the fitting.
It harms fitting the noise, more than it harms fitting the signal.
And that is our guideline.
Because of that, it's a heuristic.
So the guiding principle we found was, you move in the direction of smoother.
And the direction of smoother-- we need to find the logic in our mind.
We are moving in the direction of smoother, because the
noise is not smooth.
That is really the reason.
Because we tend to harm the noise more by doing that.
And smoother is fine when we have a surface like that.
In some learning problems, we don't have a surface to be smooth, so the
corresponding thing is: simpler.
I'll give you an example from something you have seen before, which
is the movie rating, our famous example that we keep going back to.
We had the error function for the movie rating, right?
We were trying to get the factors to multiply to a quantity that is very
close to the rating of this user, that has those factors, to this movie, that
has those factors.
That's what we did.
And the factors were our parameters.
So we were adjusting the parameters, in order to match the rating.
And now in the new terminology, you realize that this is very susceptible
to overfitting.
Because let's say I have a user, and I'm using 100 factors.
That's 100 parameters dedicated to that user.
If that user only rated 10 movies, then I'm trying to determine 100
parameters using 10 ratings.
That's bad news.
So clearly, regularization is called for.
A notion of simpler here is very interesting.
The default that you are trying to go to is that everything gives the
average rating.
In the absence of further information, consider that everything is just
average rating of all movies or all users.
Or you can be more finicky about it, and say the average of the movies that
I have seen, and average of the ratings that I have done.
Maybe I'm an optimistic user or not.
But just an average.
So you don't consider this particular movie, or this particular user.
You just take an average.
If you pull your solution toward the average, now we are regularizing
toward the simpler solution.
And indeed, that is the type of regularization that was used in the
winning solution of the Netflix competition.
So this is another notion of simpler, in a case where smoother
doesn't lend itself.
What happens if you choose a bad omega?
Which happens.
It's a heuristic choice.
I am moving toward this.
I may choose a good one, or a bad one.
And in a real situation, you will be choosing the regularizer in
a heuristic way.
You can do all of the math in the world.
But whenever you do the math, remember that you are always making
an assumption.
And your math will be as good, or as bad, as your assumption is
valid, or not valid.
There is no escaping that.
You don't hide behind a great-looking derivation, when the
basis of it is shaky.
We don't worry too much, because we have the saving grace of lambda.
We are going to go to validation, so you had better be here for the next
lecture, where we are going to choose lambda.
And if we happen to be unlucky, that after applying the guidelines we end
up with something that is actually harmful, then the validation will
tell us it's harmful, and we'll factor the regularizer out of the game
altogether.
But trying to put a regularizer in the first place is inevitable.
If you don't do it, you will end up with overfitting in almost all the
practical machine learning problems that you will encounter.
Now let's look at neural network regularizers, in order to get more
intuition about them.
And it's actually pretty useful for the intuition.
Let's look at weight decay for the neural network.
The math is not as clean, because we don't have a closed-form solution.
But there is a very interesting interpretation that relates the small
weights to simplicity, in this case.
Remember this guy?
This was the activation function of the neurons.
And they were soft threshold.
And we said that the soft threshold is somewhere between
linear and hard threshold.
What does it mean to be between?
It means that if the signal is very small, you are almost linear.
If the signal is very large, one way or the other, you are almost binary.
Right?
So let's say that you are using small weights versus big weights.
If you use very small weights, then you are always within here.
Because the weights are the ones that determine the signal.
So every neuron now is basically computing the linear function.
I have this big network, layer upon layer upon layer upon layer.
And I'm taking it, because someone told me that multilayer perceptrons are
capable of implementing large things.
So if I put enough of them, I'll be doing great.
And then I look at the functionality that I'm implementing, if I force the
weights to be very, very small.
Well, this is linear.
But this is linear of linear.
Linear of linear of linear.
And when I am done, what am I doing?
I am implementing just a simple linear function in
a huge camouflage disguise.
All the weights are just interacting and adding up, and I end up with just
a linear function, a very simple one.
So very small weights, I'm implementing a very simple function.
As you increase the weights, you are getting into the more interesting
nonlinearity here.
And if you go all the way, you will end up with a logical dependency.
And a logical dependency, as we did with a sum of products, you can
implement any functionality you want.
You're going from the most simple to the most complex, by
increasing the weights.
So indeed, we have a correspondence in this case, not just smoothness per se,
but actually the simplicity of the function you are implementing, in
terms of the size of the weights.
There is another regularizer for neural networks, which is called
weight elimination.
The idea is the following.
We said that the VC dimension of neural networks is the number of weights,
more or less.
So maybe it's a good idea to take the network, and just
kill some of the weights.
So although I have the full-fledged network, I am forcing some of the
weights to be 0, in which case the number of free parameters that I have
will be smaller.
I will have a smaller VC dimension, and I stand a better chance of
generalizing.
Maybe I won't overfit.
Now this is true.
And there is an implementation of it, which-- the argument I just said is
fewer weights lead to a smaller VC dimension--
There is a version of it that lends itself to regularization, which is
called soft weight elimination.
I'm not going to go and combinatorially say, should I kill this
weight or kill that weight?
You can see this is a nightmare, in terms of optimization.
I'm going to apply something on a continuous function that will result,
more or less, in emphasizing some of the weights and killing the others.
Here is the regularizer in this case.
It looks awfully similar to the weight decay.
If that's all I had, and this wasn't upstairs in anticipation of something
that will happen downstairs, this would be just weight decay.
I'm adding these guys, and that is the term, so I'm just doing this.
But the actual form is this fellow.
So what does this do?
Well, for very small w, beta dominates.
We end up with something proportional to w squared.
So for very small weights, you are doing weight decay.
For very large weights, the w's dominate.
Therefore, this basically is 1, close to 1.
So there is nothing to be gained by changing the weights.
At least, not much to be gained by changing the weights.
In this case, big weights are left alone.
Small weights are pushed towards zero.
We end up, after doing the optimization, clustering the weights into two
groups, serious weights that happen to have value, and other weights that are
really being pushed towards 0, that you have considered to be eliminated,
although they are soft-eliminated.
And that's the corresponding notion.
Early stopping, which we alluded to last time, is a form of regularizer,
and it's an interesting one.
Remember this thing?
We were training on E_in, no augmentation, nothing, just the
in-sample error.
And we realized by looking at the out-of-sample error, using
a test set, that it's a good idea to stop before you get to the end.
This is a form of regularizer, but it's a funny regularizer.
It's through the optimizer.
So we are not changing the objective function.
You are just handing the objective function, which is the in-sample error,
to the optimizer and telling it, please minimize this.
By the way, could you please not do a great job?
Because if you do a great job, I'm in trouble.
So that's what you do.
It's a funny situation.
It's not funny for early stopping,
because the way we choose when to stop is principled.
We are going to use validation to determine that point.
But some people get carried away and realize, maybe we can always put
the regularizer in the optimizer, and just do a sloppy job of optimization,
thus regularizing the thing.
Oh, wait a minute.
Maybe local minima is a blessing in disguise.
They force us to stop short of the global minimum, and therefore that's
a great regularizer.
OK, guys.
Heuristic is heuristic.
But we are still scientists and engineers.
Separate the concerns.
Put what you consider to be the right thing to minimize in the proper
function, in this case the augmented error.
And after that, give it to the optimizer to go all the way in
optimizing.
The wishy-washy thing is just uncertain.
I have no idea how this will work.
But if we capture as much as we can in the objective function, and we know
that we really want to minimize it, then we have a principled way of doing
that, and we will get what we want.
The early stopping is done by validation.
So the final slide is the optimal lambda, which is a good lead into the
next lecture.
What I am going to show you is the choice of the optimal lambda in the
big experiment that I did last time.
Last time we had overfitting in a situation that had
the colorful graphs.
And now I am applying regularization there, using weight decay.
And I'm just asking myself, what is lambda, given the
different levels of noise?
So we look here.
I am applying the regularization.
This is the lambda.
It's the same regularizer, and I am changing the emphasis on the
regularizer.
And I am getting the bottom line, the expected
out-of-sample error, as a result.
When there is no noise, guess what?
Regularization is not indicated.
You just put lambda equals 0, and you are fine.
There's no overfitting to begin with.
As you increase the level of noise, as you see here, first you need
regularization.
The minimum occurs at a point where lambda is not 0.
So that means you actually need regularization.
And the end result is worse anyway.
But the end result has to be worse, because there is noise.
The expected out-of-sample error will have to have that level of noise in
it, even if I fit the other thing perfectly.
And as I increase the noise, I need more regularization.
The best value of lambda is greater.
So this is very, very intuitive.
And if we can determine this value using validation, then we have a very
good thing.
Now instead of using this, which was horrible overfitting,
I am getting this.
And I'm getting the best possible, given those.
Now this happens to be for stochastic noise.
Out of curiosity, what would the situation be if we were talking about
deterministic noise?
And when you plot deterministic noise, well, you could have fooled me.
I am not increasing the sigma squared.
I am increasing the complexity of this guy, the complexity of the target.
And therefore, I'm increasing the deterministic noise.
It's exactly the same behavior.
Again, if I have this, I don't need any regularization.
As I increase the deterministic noise, I need more regularization.
The lambda is bigger, and I end up with worse performance.
And if you look at these two, that should seal the correspondence in your
mind that, as far as overfitting and its cures are concerned, deterministic
noise behaves almost exactly as if it were unknown stochastic noise.
I will stop here, and will take questions after a short break.
Let's start the Q&A.
MODERATOR: The first question is, when the regularization parameter is
chosen, say lambda, if it's chosen according to the data does that mean
we are doing data snooping?
PROFESSOR: OK.
If we were using the same data for training as for choosing the
regularization parameter, that would be bad news.
It's snooping.
But it's so clear, that I wouldn't even call it snooping.
It's blatant, in this case.
The reality is that we determine this using validation, which is a very
controlled form of using the data.
And we will discuss the subject completely, from beginning to end, in
the next lecture.
So there will be a way to deal with that.
MODERATOR: Would there be a case where you use different types of
regularization in the same function?
PROFESSOR: Correct.
Sometimes you use a combination of regularizers, with two different
parameters, depending on the performance.
As I mentioned, it is an experimental activity, more than a completely
principled activity.
There are guidelines, and there are regularizers that
stood the test of time.
And you can look at the problem, and you realize that, I'd better use
these two regularizers, because they behave differently in different parts
of the space, or something of that sort, and then decide to have
a combination.
MODERATOR: In the examples, you were using Legendre polynomials as the
orthogonal functions.
Was there any reason for these?
Or can you choose other functions?
PROFESSOR: They give me a level of generality, which is pretty
interesting.
And the solution is very simple.
So it's the analytic appeal of it that got me into this.
The typical situation in machine learning-- machine learning is
somewhere between theory and practice.
And it really has very strong grounding in both.
So the way to use theory is that, because you cannot really model every
situation such that you can get the closed-form solution.
You are far from that.
What you do is you get an idealized situation, but a situation as general
as you can get it.
With polynomials, you can do a lot of things.
So because I can get the solution in this case, when I look at the form of
the solution, I may be able to read off some intuitive properties that I can
extrapolate, and apply as a leap of faith, to situations where my
assumptions don't hold.
In this case, after getting this, we had a specific form for weight decay.
And when we look at the performance, we realize that
smoothness is a good criterion.
And then we look for smoothness or simplicity, and we interpret that in
terms of, oh, smoothness is actually good because of the properties of
noise, and so on.
So there is a formal part, where we can develop it completely, and try to make
it as general as possible while mathematically tractable.
But then try to see if the lessons learned from the solution, that you got
analytically, can apply to a situation in a heuristic way, where you don't
have the full mathematical benefit because the assumptions don't hold.
MODERATOR: Could noise be an indicator of missing input?
PROFESSOR: Missing input is a big deal in machine learning.
Sometimes you are missing some attributes of the input.
And it can be treated in a number of ways.
One of them is as if it's noise.
But missing inputs are sufficiently well defined, that they are treated
with their own methodology, rather than being generic noise.
MODERATOR: How do you trade off choosing more features in your
transformation, with the regularization?
PROFESSOR: It's a good question.
The first question was a question that we addressed, even before we heard of
overfitting and regularization.
And it was a question of generalization.
What is the dimensionality that we can afford, given the
resources of the data?
What regularization adds to the equation is that, maybe you can afford
a little bit of a bigger dimension, provided that you do the proper
regularization.
So again, it's the question of instead of having discrete steps-- I'm going
from this hypothesis set, to this hypothesis set, to this hypothesis set.
Let me try to find a continuum, such that, by the validation or by other
methods, I'll be able to find a sweet spot where I get the best performance.
And the best performance could be lying between two of
the discrete steps.
In this case, I can say, I couldn't initially afford to go to the
bigger hypothesis set.
Because if I go for it, and I go unconstrained, the
generalization just kills me.
But now what I'm going to do, I'm going to go to it anyway, and apply
regularization.
So I go this, and then I'm tracking back in continuous steps using
regularization.
And I will end up with a situation, maybe, that I can afford, that wasn't
accessible to me without regularization, because it didn't
belong to the discrete grid that I used to work in.
MODERATOR: When regularization is done, will it depend on the data set
that you use for training?
PROFESSOR: The regularization is a term added.
So there is no explicit dependency of the regularization on the data set.
The data set goes into the in-sample error.
The regularization goes into a property of the hypothesis.
That is fairly independent--
actually, in the examples we gave, were independent of the inputs.
The dependency comes from the fact that the optimal parameter, lambda,
does depend on the training set.
But I said that we were not going to worry about that analytically.
Because when all is said and done, lambda will be determined by
validation.
So it will inherit any properties just because of that.
MODERATOR: I think that's it.
PROFESSOR: We'll see you next week.