Tip:
Highlight text to annotate it
X
In this video, I wanna give
you more practical tips for getting gradient descent to work.
The ideas in this video will
center around the learning rate alpha.
Concretely, here's the gradient
descent update rule and what
I want to do in this video
is tell you about what
I think of as debugging and some
tips for making sure that
Gradient Descent is working correctly
and second, I want to tell you
how to choose the rates
out for, but this is how
I go about choosing it.
Here's something that I often do
to make sure gradient descent is working correctly.
The job of gradient descent is
to find a value of
theta for you that, you
know, hopefully minimizes the cost function j of theta.
What I often do is therefore
pluck the cost function j
of theta as gradient descent runs.
So, the x-axis here is
the number of iteration of gradient
descent and as gradient descent
runs, you'll hopefully get a
plot that maybe looks like this.
Notice that the x-axis is
a number of iterations previously
we were looking at plots of
J of theta where the
X-axis, where the horizontal axis,
was the parameter vector theta but this is not where this is.
Concretely, what this point
is is I'm going
to rank gradient descent for hundred iterations.
And whatever value I get
for theta after a hundred
of the rations and get,
you know, some value of theta
after a hundred iterations and I'm
going to evaluate the cost
function J of theta for
the value of theta I get
after a hundred iterations and this
vertical height is the
value of J of theta for
the value of theta I got
after a hundred other ratios of
gradient descent and this
point here, that corresponds
to the value of J of
theta for the theta
that I get after I've
run grade and descent for two hundred iterations.
So what this plot is showing,
is it's showing the value of
your cost function after iteration of gradient descent.
And, if gradient descent is
working properly, then J
of theta should decrease.
after every iteration.
And one useful thing
that this sort of plot can
tell you also is that
if you look at the specific figure
that I've drawn, it looks like
by the time you've gotten out
to three hundred iterations,
between three and four hundred
iterations, in this segment, it
looks like J of theta hasn't gone down much more.
So by the time you get
to four hundred iterations, it looks
like this curve has flattened out here.
And so, way out
here at four hundred iterations, it
looks like grade and descend has
more or less converged because your
cost function isn't going down much more.
So looking at this figure can
also help you judge
whether or not gradient descent has converged.
By the way, the number of
iterations that gradient descent takes
to converge for a physical
application can vary a lot.
So maybe for one application gradient
descent may converge after just
thirty iterations, for a
different application gradient descent
made the 3,000 iterations.
For another learning algorithm
it may take three million iterations.
It turns out to be
very difficult to tell in
advance how many iterations gradient
descent needs to converge, and
is usually by plotting this sort of plot.
Plotting the cause function as we increase the number of iterations.
It's usually by looking at these
plots that I tried to tell
if gradient descent has converged.
It is also possible to come
up with automatic convergence test; namely
to have an algorithm to try
to tell you if gradient descent
has converged and here's maybe
a pretty typical example of an
automatic convergence test and
so, you test the clear convergence
if your cause function jf theta
decreases by less than
some small value epsilon, some
small value ten to the
minus three in one iteration,
but I find that usually
choosing what this threshold is is pretty difficult.
So, in order to check
your gradient descent has converged, I
actually tend to look at
plots like like this
figure on the left rather than
rely on an automatic convergence test.
Looking at this sort of
figure can also tell you or
give you an advanced warning if maybe
gradient descent is not working correctly.
Concretely, if you plug
jf theta as a function
of number of iterations, then, if
you see a figure like this,
where J of theta is actually
increasing, then that gives
you a clear sign that gradient descent is not working.
And a figure like this
usually means that you should be using smaller learning rate alpha.
If J of theta is actually
increasing, the most common
cause for that is if
you're trying to minimize
the function that maybe looks like this.
That's if your learning rate is
too big then if you
start off there, gradient descent
may overshoot the minimum, send
you there, then if only there's
too big, you may overshoot again,
it will send you there and
so on so that what
you really wanted was really
start here and for to slowly go downhill.
But if the learning is too
big then gradient descent can
instead keep on over
shooting the minimum so
that you actually end up
getting worse and worse instead
of getting the higher values of
the cost function j of theta
so do you end up with a
plot like and if you
see a plot like this the
fix usually is to just
use a smaller value of alpha.
Oh, and also of course make
sure that your code does not have a bug in it.
But usually to watch it
out of the firms is the
most common, could be a common problem.
Similarly, sometimes, you may
also see j of theta
do something like this and it
go down for a while then
go up then go down for a while then go up.
Go down for a while, it
goes up and so on and
and to fix for something like
this is also to use a smaller value of alpha.
I'm not going to prove it
here, but undeniable assumptions about
the cost function, which does proof of linear regression.
You can show of mathematicians have
shown that if your learning
rate offer is small enough
then j of theta should decrease on every single iteration.
So, if this doesn't happen, probably
means algorithm is too big then
you should send a smaller, but of
course, you all So you don't
want your learning rate to be
too small because if you
do that, if you were
to do that, then gradient descent can be slow to converge.
And if alpha were too
small, you might end up
starting out here, say, and,
you know, end up taking just
minuscule, minuscule baby steps.
Right?
And just taking a lot
of iterations before you finally get to the minimum.
And so, if alpha is too
small, gradient descent can
make very slow progress and be slow to converge.
To summarize, if the learning
rate is too small, you can
have a slow convergence problem, and
if the learning rate is too
large, j of theta may
not decrease on every iteration
and may not even converge.
In some cases, if the learning
rate is too large, slow convergence
is also possible, but the
more common problem you see
is that just that j of
theta may not decrease on every iteration.
And in order to debug all
of these things, often plotting that
j of theta as a function
of the number of iterations can help you figure out what's going on.
Concretely, what I actually
do when I run gradient
descent is I would try a range of values.
So just try running gradient descent
with a range of values for
alpha, like 0.001, 0.01,
so these are a
factor of 10 differences, and
for these differences of this
of alpha, just plot j of
theta as a function of number
of iterations and then pick
the value of alpha that, you
know, seems to be causing j of theta to decrease rapidly.
In fact, what I do actually isn't these steps of ten.
So, you know, this is
a scale factor of ten if you reach the top.
What I'll actually do is try
this range of values and
so on where this is,
you know, 0.001
then increase the linear rate to
3.4 to get 0.03 and then
to step up this is another
roughly 3 fold increase point
of 0.03 to 0.01s and so these
are roughly, you know,
trying out gradient descents with each
value I try being about
3X bigger than the previous value.
So what I'll do is a range
of values until I've made sure
that I've found one value that
is too small and made sure
I found one value that is
too large, and then I sort
of try to pick the largest
possible value or just something
slightly smaller than the
largest reasonable value that I found.
And when I do that
usually it just gives me
a good learning rate for my problem.
And if you do this
too, hopefully you will be
able to choose a good
learning rate for your implementation
of gradient descent.