4.4 - Gradient descent in practice 2 - Learning rate - [ machine learning ] by andrew ng

In this video, I wanna give you more practical tips for getting gradient descent to work. The ideas in this video will center around the learning rate alpha. Concretely, here's the gradient descent update rule and what I want to do in this video is tell you about what I think of as debugging and some tips for making sure that Gradient Descent is working correctly and second, I want to tell you how to choose the rates out for, but this is how I go about choosing it. Here's something that I often do to make sure gradient descent is working correctly. The job of gradient descent is to find a value of theta for you that, you know, hopefully minimizes the cost function j of theta. What I often do is therefore pluck the cost function j of theta as gradient descent runs. So, the x-axis here is the number of iteration of gradient descent and as gradient descent runs, you'll hopefully get a plot that maybe looks like this. Notice that the x-axis is a number of iterations previously we were looking at plots of J of theta where the X-axis, where the horizontal axis, was the parameter vector theta but this is not where this is. Concretely, what this point is is I'm going to rank gradient descent for hundred iterations. And whatever value I get for theta after a hundred of the rations and get, you know, some value of theta after a hundred iterations and I'm going to evaluate the cost function J of theta for the value of theta I get after a hundred iterations and this vertical height is the value of J of theta for the value of theta I got after a hundred other ratios of gradient descent and this point here, that corresponds to the value of J of theta for the theta that I get after I've run grade and descent for two hundred iterations. So what this plot is showing, is it's showing the value of your cost function after iteration of gradient descent. And, if gradient descent is working properly, then J of theta should decrease. after every iteration. And one useful thing that this sort of plot can tell you also is that if you look at the specific figure that I've drawn, it looks like by the time you've gotten out to three hundred iterations, between three and four hundred iterations, in this segment, it looks like J of theta hasn't gone down much more. So by the time you get to four hundred iterations, it looks like this curve has flattened out here. And so, way out here at four hundred iterations, it looks like grade and descend has more or less converged because your cost function isn't going down much more. So looking at this figure can also help you judge whether or not gradient descent has converged. By the way, the number of iterations that gradient descent takes to converge for a physical application can vary a lot. So maybe for one application gradient descent may converge after just thirty iterations, for a different application gradient descent made the 3,000 iterations. For another learning algorithm it may take three million iterations. It turns out to be very difficult to tell in advance how many iterations gradient descent needs to converge, and is usually by plotting this sort of plot. Plotting the cause function as we increase the number of iterations. It's usually by looking at these plots that I tried to tell if gradient descent has converged. It is also possible to come up with automatic convergence test; namely to have an algorithm to try to tell you if gradient descent has converged and here's maybe a pretty typical example of an automatic convergence test and so, you test the clear convergence if your cause function jf theta decreases by less than some small value epsilon, some small value ten to the minus three in one iteration, but I find that usually choosing what this threshold is is pretty difficult. So, in order to check your gradient descent has converged, I actually tend to look at plots like like this figure on the left rather than rely on an automatic convergence test. Looking at this sort of figure can also tell you or give you an advanced warning if maybe gradient descent is not working correctly. Concretely, if you plug jf theta as a function of number of iterations, then, if you see a figure like this, where J of theta is actually increasing, then that gives you a clear sign that gradient descent is not working. And a figure like this usually means that you should be using smaller learning rate alpha. If J of theta is actually increasing, the most common cause for that is if you're trying to minimize the function that maybe looks like this. That's if your learning rate is too big then if you start off there, gradient descent may overshoot the minimum, send you there, then if only there's too big, you may overshoot again, it will send you there and so on so that what you really wanted was really start here and for to slowly go downhill. But if the learning is too big then gradient descent can instead keep on over shooting the minimum so that you actually end up getting worse and worse instead of getting the higher values of the cost function j of theta so do you end up with a plot like and if you see a plot like this the fix usually is to just use a smaller value of alpha. Oh, and also of course make sure that your code does not have a bug in it. But usually to watch it out of the firms is the most common, could be a common problem. Similarly, sometimes, you may also see j of theta do something like this and it go down for a while then go up then go down for a while then go up. Go down for a while, it goes up and so on and and to fix for something like this is also to use a smaller value of alpha. I'm not going to prove it here, but undeniable assumptions about the cost function, which does proof of linear regression. You can show of mathematicians have shown that if your learning rate offer is small enough then j of theta should decrease on every single iteration. So, if this doesn't happen, probably means algorithm is too big then you should send a smaller, but of course, you all So you don't want your learning rate to be too small because if you do that, if you were to do that, then gradient descent can be slow to converge. And if alpha were too small, you might end up starting out here, say, and, you know, end up taking just minuscule, minuscule baby steps. Right? And just taking a lot of iterations before you finally get to the minimum. And so, if alpha is too small, gradient descent can make very slow progress and be slow to converge. To summarize, if the learning rate is too small, you can have a slow convergence problem, and if the learning rate is too large, j of theta may not decrease on every iteration and may not even converge. In some cases, if the learning rate is too large, slow convergence is also possible, but the more common problem you see is that just that j of theta may not decrease on every iteration. And in order to debug all of these things, often plotting that j of theta as a function of the number of iterations can help you figure out what's going on. Concretely, what I actually do when I run gradient descent is I would try a range of values. So just try running gradient descent with a range of values for alpha, like 0.001, 0.01, so these are a factor of 10 differences, and for these differences of this of alpha, just plot j of theta as a function of number of iterations and then pick the value of alpha that, you know, seems to be causing j of theta to decrease rapidly. In fact, what I do actually isn't these steps of ten. So, you know, this is a scale factor of ten if you reach the top. What I'll actually do is try this range of values and so on where this is, you know, 0.001 then increase the linear rate to 3.4 to get 0.03 and then to step up this is another roughly 3 fold increase point of 0.03 to 0.01s and so these are roughly, you know, trying out gradient descents with each value I try being about 3X bigger than the previous value. So what I'll do is a range of values until I've made sure that I've found one value that is too small and made sure I found one value that is too large, and then I sort of try to pick the largest possible value or just something slightly smaller than the largest reasonable value that I found. And when I do that usually it just gives me a good learning rate for my problem. And if you do this too, hopefully you will be able to choose a good learning rate for your implementation of gradient descent.