Tip:
Highlight text to annotate it
X
Here is my solution, I have the value function initialized. It has lots of 999s.
The policy is a similar function in 3D.
Then I have a function called policy2d, which is the one I'm later going to print.
That's the same in 2D.
Scrolling down, my update function is exactly the same as before for dynamic programming
While change exists go through [x, y]'s and all orientations
of which there are 4, so it's now a deeper loop.
If you found the goal location, then update the value,
and if there's an actual update, set "change" to True
and also mark it as the goal location.
Otherwise, if our grid cell is navigable at all,
let's go through the 3 different actions and here's a tricky part
how to make the action work but it works beautifully.
We go through the 3 different actions.
When we tag the ith action,
we add the corresponding orientation change to our orientation modulo 4.
It's a cyclic buffer, so this might subtract 1.
Keeping it the same will add 1 to orientation.
Then we apply the corresponding new motion model to x and y to obtain x2 and y2.
Then over here is our model of a car that steers first and then moves.
Scrolling down further, if we arrived at a valid grid cell in that it's still inside the grid
and it's not an obstacle, then like before we add to the value
the value of this new grid cell plus the cost of the corresponding action.
This is non-uniform, depending on what action we pick now.
This improves over the existing value.
We set this value to be the new value, and we mark change as True.
We also memorize the action name as before.
This is all effectively the same code as we had before
when we did dynamic programming in a 2-dimensional world.
It gets us the value function, and it gets us the policy action.
However, I printed out a 2-dimensional table, not a 3-dimensional table.
To get to the 2-dimensional table, I now need to be sensitive of my initial state.
Otherwise, it actually turns out to be undefined.
Let me set the initial state to be x, y, and orientation.
All I do now is run the policy.
With the very first state, I copy over the policy form the 3-dimensional table
into the 2-dimensional one, which will be this hash mark over here.
While I haven't reached the goal state quite yet as indicated
by checking for the star in my policy table.
Now, my policy table has a hash mark R and L,
but otherwise is the same as before.
If it's a hash mark, we just keep our orientation the way it is.
If it's R, I turn to the right. L is turn to the left.
I apply my forward motion,
and I then update my new x and y coordinates
to be the corresponding after the motion,
and I update my orientation to be o2.
Finally, I copy the 3-dimensional symbol for my policy straight into the 2-dimensional array.
This is the array that I finally print.
The key insight here is to go from the 3-dimensional full policy
to a 2-dimensional array I had to run the policy.
That's something you would have done to get back this table over here.
That's somewhat nontrivial. I didn't tell you this, but I hope you figured it out.
But everything else is the same dynamic programming loop that you've seen before.