An introduction to Q-Learning: Reinforcement Learning (Markov Decision Processes)

February 16, 2020

Modeling stochasticity: Markov Decision Processes

Consider the robot is currently in the red room and it needs to go to the green room.

Let’s now consider, the robot has a slight chance of dysfunctioning and might take the left or right or bottom turn instead of taking the upper turn in order to get to the green room from where it is now (red room). Now, the question is how do we enable the robot to handle this when it is out there in the above environment?

An environment with an agent (with stochasticity)

This is a situation where the decision making regarding which turn is to be taken is partly random and partly under the control of the robot. Partly random because we are not sure when exactly the robot might dysfunction and partly under the control of the robot because it is still making a decision of taking a turn on its own and with the help of the program embedded into it. Here is the definition of Markov Decision Processes (collected from Wikipedia):

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

You may focus only on the highlighted part. We have the exact same situation here in our case.

We have now introduced ourselves with the concept of partly random and partly controlled decision making. We need to give this concept a mathematical shape (most likely an equation) which then can be taken further. You might be surprised to see that we can do this with the help of the Bellman Equation with a few minor tweaks.

Here is the original Bellman Equation, again:

V (s) = max_{a} (R (s, a) + γ V (s^{'}))

What needs to be changed in the above equation so that we can introduce some amount of randomness here? As long as we are not sure when the robot might not take the expected turn, we are then also not sure in which room it might end up in which is nothing but the room it moves from its current room. At this point, according to the above equation, we are not sure of s′ which is the next state (room, as we were referring them). But we do know all the probable turns the robot might take! In order to incorporate each of these probabilities into the above equation, we need to associate a probability with each of the turns to quantify that the robot has got x% chance of taking this turn. If we do so, we get:

V (s) = max_{a} (R (s, a) + γ \sum s^{'} P (s, a, s^{'}) V (s^{'}))

Taking the new notations step by step:

P(s, a, s′) - the probability of moving from room s to room s′ with action a
$\sum_{s^{'}} P (s, a, s^{'}) V (s^{'})$ - expectation of the situation that the robot incurs randomness

Notice, everything else from the above two points is exactly the same. Let’s assume the following probabilities are associated with each of the turns the robot might take while being in that red room (to go to the green room).

An environment with an agent (with probabilities)

When we associate probabilities to each of these turns, we essentially mean that there is an 80% chance that the robot will take the upper turn. If we put all the required values in our equation, we get:

V (s) = max_{a} (R (s, a) + γ ((0.8 V (r o o m_{u p})) + (0.1 V (r o o m_{d o w n})) + . . .))

Note that the value footprints will now change due to the fact that we are incorporating stochasticity here. But this time, we will not calculate those value footprints. Instead, we will let the robot to figure it out (more on this in a moment).

Up until this point, we have not considered about rewarding the robot for its action of going into a particular room. We are only rewarding the robot when it gets to the destination. Ideally, there should be a reward for every action the robot takes to help it better assess the quality of its actions. The rewards need not be always the same. But it is much better than having some amount reward for the actions than having no rewards at all. This idea is known as the living penalty. In reality, the rewarding system can be very complex and particularly modeling sparse rewards is an active area of research in the domain reinforcement learning. If you would like to give a spin to this topic then following resources might come in handy:

In the next section, we will introduce the notion of the quality of an action rather than looking at the value of going into a particular room (V(s)).

Search This Blog

Hany Ouf

An introduction to Q-Learning: Reinforcement Learning (Markov Decision Processes)

Modeling stochasticity: Markov Decision Processes

Comments

Post a Comment

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Generative AI - Prompting with purpose: The RACE framework for data analysis

Best Practices for Storing and Loading JSON Objects from a Large SQL Server Table Using .NET Core