An introduction to Q-Learning: Reinforcement Learning (Temporal difference)

The last piece of the puzzle: Temporal difference

Recollect the statement from a previous section:
But this time, we will not calculate those value footprints. Instead, we will let the robot to figure it out (more on this in a moment).
Temporal Difference is the component that will help the robot to calculate the Q-values with respect to the changes in the environment over time. Consider our robot is currently in the marked state and it wants to move to the upper state. Note that the robot already knows the Q-value of making the action i.e. moving to the upper state.  
An environment with an agent
We know that the environment is stochastic in nature and the reward that the robot will get after moving to the upper state might be different from an earlier observation. So how do we capture this change (read difference)? We recalculate the new Q(s, a) with the same formula and subtract the previously known Q(s, a) from it.
The equation that we just derived gives the temporal difference in the Q-values which further helps to capture the random changes that the environment may impose. The new Q(s, a) is updated as the following:
Qt(s,a)=Qt1(s,a)+αTDt(a,s)
where,
  • ɑ is the learning rate which controls how quickly the robot adopts to the random changes imposed by the environment
  • Qt(s,a)is the current Q-value
  • Qt1(s,a)is the previously recorded Q-value
If we replaceTDt(s,a)with its full-form equation, we should get:
Qt(s,a)=Qt1(s,a)+α(R(s,a)+γmaxaQ(s,a)Qt1(s,a))
We now have all the little pieces of Q-Learning together to move forward to its implementation part. Feel free to review the problem statement once which we discussed in the very beginning.

Comments

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Generative AI - Prompting with purpose: The RACE framework for data analysis

Best Practices for Storing and Loading JSON Objects from a Large SQL Server Table Using .NET Core