Tip:
Highlight text to annotate it
X
So let's move on to Active Reinforcement Learning
and, in particular, let's examine a simple
approach called a Greedy Reinforcement Learner.
And the way that works is it uses the same
passive TD learning algorithm that we talked about,
but, after each time we update the utilities
or maybe after a couple of updates--you can decide how often you want to do it--
after the change to the utilities,
we recompute the new optimal policy, pi.
So we throw away our old pi, pi1,
and replace it with a new pi, pi2--
which is a result of solving the MDP described by our new estimates of the utiliities.
Now we have a new policy,
and we continue learning with that new policy.
And so, if the initial policy was flawed,
the Greedy algorithm would tend to move away from the initial policy,
towards a better policy--and we can show how well that works.