An introduction to Q-Learning: Reinforcement Learning (Bellman Equation)

Defining the problem statement

We are to build a few autonomous robots for a guitar building factory. These robots will help the guitar luthiers by conveying them the necessary guitar parts that they would need in order to craft a guitar. These different parts are located at nine different positions within the factory warehouse. Guitars parts include polished wood stick for the fretboard, polished wood for the guitar body, guitar pickups and so on. The luthiers have prioritized the location that contains body woods to be the topmost. They have provided the priorities for other locations as well which we will look in a moment. These locations within the factory warehouse look like -

A sample environment

As we can see there are little obstacles present (represented with smoothed lines) in between the locations. L6 is the top-priority location that contains the polished wood for preparing guitar bodies. Now, the task is to enable the robots so that they can find the shortest route from any given location to another location on their own.
The agents, in this case, are the robots. The environment is the guitar factory warehouse.

The states

The states are the locations. The location in which a particular robot is present in the particular instance of time will denote its state. Machines understand numbers rather than letters. So, let’s map the location codes to numbers.

Locations mapped as states

The actions

In our example, the actions will be the direct locations that a robot can go to from a particular location. Consider, a robot is at the L8 location and the direct locations to which it can move are L5L7 and L9. The below figure may come in handy in order to visualize this.

Sample actions


As you might have already guessed the set of actions here is nothing but the set of all possible states of the robot. For each location, the set of actions that a robot can take will be different. For example, the set of actions will change if the robot is in L1.

The rewards

By now, we have the following two sets:
  • A set of states:S=0,1,2,3,4,5,6,7,8
  • A set of actions:A=0,1,2,3,4,5,6,7,8
The rewards, now, will be given to a robot if a location (read it state) is directly reachable from a particular location. Let’s take an example:
L9 is directly reachable from L8. So, if a robot goes from L8 to L9 and vice-versa, it will be rewarded by 1. If a location is not directly reachable from a particular location, we do not give any reward (a reward of 0). Yes, the reward is just a number here and nothing else. It enables the robots to make sense of their movements helping them in deciding what locations are directly reachable and what are not. With this cue, we can construct a reward table which contains all the reward values mapping between all the possible states (locations).


Table of rewards
Table of rewards

In the above table, we have all the possible rewards that a robot can get by moving in between the different states.
Now comes an interesting decision. Remember from the previous sections that the luthier prioritized L6 to be of the topmost? So, how do we incorporate this fact in the above table? This is done by associating the topmost priority location with a very higher reward than the usual ones. Let’s put 999 in the cell (L6L6):

Table of rewards with a higher reward for the topmost location

We have now formally defined all the vital components for the solution we are aiming for the problem discussed above. We will shift gears a bit and study some of the fundamental concepts that prevail in the world of reinforcement learning. We will start with the Bellman Equation.

The Bellman Equation

Consider the following square of rooms which is analogous to the actual environment from our original problem but without the barriers.

An empty environment

Now suppose a robot needs to go to the room, marked in green from its current position (A) using the specified direction.

Sample environment, agent and directions to proceed

How can we enable the robot to do this programmatically? One idea would be to introduce some kind of footprint which the robot would be able to follow like below.

An environment with value footprints

Here, a constant value is specified in each of the rooms which will come along the robot’s way if it follows the direction specified above. In this way, if it starts at location A, it will be able to scan through this constant value and will move accordingly. But this would only work if the direction is prefixed and the robot always starts at location A. Now, consider the robot starts at the following location:

An environment with value footprint (contd.)

The robot now sees footprints in two different directions. It is, therefore, unable to decide which way to go in order to get to the destination (green room). It happens primarily because the robot does not have a way to remember the directions to proceed. So, our job now is to enable the robot with a memory. This is where the Bellman Equation comes into play:
V(s)=maxa(R(s,a)+γV(s))
where,
  • s = a particular state (room)
  • a = action (moving between the rooms)
  • s′ = state to which the robot goes from s
  • 𝜸 = discount factor (we will get to it in a moment)
  • R(s, a) = a reward function which takes a state s and action a and outputs a reward value
  • V(s) = value of being in a particular state (the footprint)
We consider all the possible actions and take the one that yields the maximum value.
There is one constraint however regarding the value footprint i.e. the room, marked in yellow just below the green room will always have a value of 1 to denote that it is one of the nearest room adjacent to the green room. This is also to ensure that a robot gets a reward when it goes from the yellow room to the green room. Let’s now see how to make sense of the above equation here.
We will assume a discount factor of 0.9.

An environment with value footprint (contd.)

* For this room (read state) what will be V(s)? Let’s put the values into the equation straightly:



V(s)=maxa(R(s,a)+γV(s))
V(s)=maxa(0+0.91)=0.9
Here, the robot will not get any reward for going to the state (room) marked in yellow, hence R(s, a) = 0 here. The robot knows the values of being in the yellow room hence V(s′) is 1. Following this for the other states, we should get:

An environment with some value footprints computed from the Bellman equation

A couple of things to notice here:
  • The max() function helps the robot to always choose the state that gives it the maximum value of being in that state.
  • The discount factor 𝜸 notifies the robot about how far it is from the destination. This typically specified by the developer of the algorithm that would be instilled in the robot.
The other states can also be given their respective values in a similar way:

An environment with all the value footprints computed from the Bellman equation

The robot now can proceed its way through the green room utilizing these value footprints even if it is dropped at any arbitrary room in the above square. Now, if a robot lands up in the highlighted (sky blue) room, it will still find two options to choose from. But eventually, either of the paths will be good enough for the robot to take because of the way the value footprints are now laid out.
Notice how the main task of reaching a destination from a particular source got broken down to similar subtasks. It turns out that there is a particular programming paradigm developed especially for solving problems that have repetitive subproblems in them. It is referred to as dynamic programming. It was invented by Richard Bellman in 1954 who also coined the equation we just studied (hence the name, Bellman Equation). Note that this is one of the key equations in the world of reinforcement learning.
If we think realistically, our surroundings do not always work in the way we expect. There is always a bit of stochasticity involved in it. This applies to a robot as well. Sometimes, it might so happen that the robot’s inner machinery got corrupted. Sometimes, the robot may come across some hindrances on its way which may not be known to it beforehand. Sometimes, even if the robot knows that it needs to take the right turn, it will not. So, how do we introduce this stochasticity in our case? Markov Decision Processes.

Comments

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Generative AI - Prompting with purpose: The RACE framework for data analysis

Best Practices for Storing and Loading JSON Objects from a Large SQL Server Table Using .NET Core