Transitioning to Q-Learning

By now, we have got the following equation which gives us a value of going to a particular state (form now on, we will refer to the rooms as states) taking the stochasticity of the environment into the account:

V (s) = max a (R (s, a) + γ \sum s^{'} P (s, a, s^{'}) V (s^{'}))

We have also learned very briefly about the idea of living penalty which deals with associating each move of the robot with a reward.

Q-Learning poses an idea of assessing the quality of an action that is taken to move to a state rather than determining the possible value of the state (value footprint) being moved to.

Earlier, we had:

An environment with an agent (with possible value footprints)

If we incorporate the idea of assessing the quality of actions for moving to a certain state s′.

An environment with an agent (with quality of actions)

The robot now has four different states to choose from and along with that, there are four different actions also for the current state it is in. So how do we calculate Q(s, a) i.e. the cumulative quality of the possible actions the robot might take? Let’s break down.

From this equation

V (s) = max a (R (s, a) + γ \sum s^{'} P (s, a, s^{'}) V (s^{'}))

, if we discard the max() function, we get:

R (s, a) + γ \sum_{s^{'}} (P (s, a, s^{'}) V (s^{'}))

Essentially, in the equation that produces V(s), we are considering all possible actions and all possible states (from the current state the robot is in) and then we are taking the maximum value caused by taking a certain action. The above equation produces a value footprint is for just one possible action. In fact, we can think of it as the quality of the action:

Q (s, a) = R (s, a) + γ \sum_{s^{'}} (P (s, a, s^{'}) V (s^{'}))

Now that we have got an equation to quantify the quality of a particular action we are going to make a little adjustment in the above equation. We can now say that V(s) is the maximum of all the possible values of Q(s, a). Let’s utilize this fact and replace V(s′) as a function of Q():

Q (s, a) = R (s, a) + γ \sum_{s^{'}} (P (s, a, s^{'}) max_{a^{'}} Q (s^{'}, a^{'}))

But why would we do that?

To ease our calculations. Because now, we have only one function Q() (which is also at the core of the dynamic programming paradigm) to calculate and R(s, a) is a quantified metric which produces rewards of moving to a certain state. The qualities of the actions are called Q-values. And from now on, we will refer the value footprints as the Q-values.

We now have the last piece of the puzzle remaining i.e. temporal difference before we jump to the implementation part and we are going to study that in the next section.

Search This Blog

Hany Ouf

An introduction to Q-Learning: Reinforcement Learning (Transitioning to Q-Learning)

Transitioning to Q-Learning

Comments

Post a Comment

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Generative AI - Prompting with purpose: The RACE framework for data analysis

Best Practices for Storing and Loading JSON Objects from a Large SQL Server Table Using .NET Core