Unit 10 16 Balancing Policy.mp4

So the question, then, is: How do we get this learner out of its rut? It improved its policy for awhile, but then it got stuck in this policy where we go here, go up and then go right. Most of the time, that's a perfectly good policy. But if a stochastic error makes us slip into the minus 1, then it hurts us. We'd like to be able to say we're going to stop doing that and somehow find this route. But in order to find that new route, we'd have to spend some time executing a policy which was not the best policy known to us. In other words, we'd have to stop exploiting the best policy we'd found so far--which is this one-- and start exploring, to see if maybe there's a better policy. And exploring could lead us astray and cause us to waste a lot of time. So we have to figure out: what's the right trade-off? When is it worth exploring to try to find something better for the long term-- even though we know that exploring is going to hurt us in the short term? Now, one possibility is, certainly, random exploration. That is, we can follow our best policy some percentage of the time, and then randomly, at some point, we can decide to take an action which is not the optimal action. So we're here, the optimal action would be to go east; and we say, "Well, this time we're gong to choose something else-- let's try going north. And then we explore from there and see if we've learned something. So that policy does, in fact, work-- randomly making moves with some probability--but it tends to be slow to converge. In order to get something better, we have to really understand what's going on with our exploration, versus exploitation.