Tip:
Highlight text to annotate it
X
This problem involves the Q-learning agent who is currently situated at this square
called (3,3), and executes the NORTH action trying to go up,
but because the environment is stochastic, it actually ends up arriving at this terminal state
with value 100.
And what I want you to answer is how should the Q-values be updated for this state,
and I want you to enter the Q-values over here because we don't want you to
mess up the original, and we'll use the formula below which I should point out is
from the Sarsa version of Q-learning,
and in this formula, the parameter alpha--the learning rate--will take on the value of 1/2, and
gamma--the discount rate--will be 0.9,
and all the rewards for moving from one state to the next are 0
with the exception of moving into the terminal state,
and this Q of S prime, A prime--that means what goes on in the next state,
so here we were in this S, and we took the action of going NORTH,
and we transferred into this state, and in that state, no matter what action you take
the Q value is always 100, so this value here will always be 100.