An introduction to Q-Learning: Reinforcement Learning (Temporal difference)
The last piece of the puzzle: Temporal difference
Recollect the statement from a previous section:
But this time, we will not calculate those value footprints. Instead, we will let the robot to figure it out (more on this in a moment).
Temporal Difference is the component that will help the robot to calculate the Q-values with respect to the changes in the environment over time. Consider our robot is currently in the marked state and it wants to move to the upper state. Note that the robot already knows the Q-value of making the action i.e. moving to the upper state.

We know that the environment is stochastic in nature and the reward that the robot will get after moving to the upper state might be different from an earlier observation. So how do we capture this change (read difference)? We recalculate the new Q(s, a) with the same formula and subtract the previously known Q(s, a) from it.

The equation that we just derived gives the temporal difference in the Q-values which further helps to capture the random changes that the environment may impose. The new Q(s, a) is updated as the following:
where,
- ɑ is the learning rate which controls how quickly the robot adopts to the random changes imposed by the environment
- is the current Q-value
- is the previously recorded Q-value
If we replacewith its full-form equation, we should get:
We now have all the little pieces of Q-Learning together to move forward to its implementation part. Feel free to review the problem statement once which we discussed in the very beginning.
Comments
Post a Comment