Unit 10 14 Active Reinforcement Learning.mp4

So let's move on to Active Reinforcement Learning and, in particular, let's examine a simple approach called a Greedy Reinforcement Learner. And the way that works is it uses the same passive TD learning algorithm that we talked about, but, after each time we update the utilities or maybe after a couple of updates--you can decide how often you want to do it-- after the change to the utilities, we recompute the new optimal policy, pi. So we throw away our old pi, pi1, and replace it with a new pi, pi2-- which is a result of solving the MDP described by our new estimates of the utiliities. Now we have a new policy, and we continue learning with that new policy. And so, if the initial policy was flawed, the Greedy algorithm would tend to move away from the initial policy, towards a better policy--and we can show how well that works.