An introduction to Q-Learning: Reinforcement Learning (Bellman Equation)
Defining the problem statement
We are to build a few autonomous robots for a guitar building factory. These robots will help the guitar luthiers by conveying them the necessary guitar parts that they would need in order to craft a guitar. These different parts are located at nine different positions within the factory warehouse. Guitars parts include polished wood stick for the fretboard, polished wood for the guitar body, guitar pickups and so on. The luthiers have prioritized the location that contains body woods to be the topmost. They have provided the priorities for other locations as well which we will look in a moment. These locations within the factory warehouse look like -

As we can see there are little obstacles present (represented with smoothed lines) in between the locations. L6 is the top-priority location that contains the polished wood for preparing guitar bodies. Now, the task is to enable the robots so that they can find the shortest route from any given location to another location on their own.
The agents, in this case, are the robots. The environment is the guitar factory warehouse.
The states
The states are the locations. The location in which a particular robot is present in the particular instance of time will denote its state. Machines understand numbers rather than letters. So, let’s map the location codes to numbers.

The actions
In our example, the actions will be the direct locations that a robot can go to from a particular location. Consider, a robot is at the L8 location and the direct locations to which it can move are L5, L7 and L9. The below figure may come in handy in order to visualize this.

As you might have already guessed the set of actions here is nothing but the set of all possible states of the robot. For each location, the set of actions that a robot can take will be different. For example, the set of actions will change if the robot is in L1.
The rewards
By now, we have the following two sets:
- A set of states:
- A set of actions:
The rewards, now, will be given to a robot if a location (read it state) is directly reachable from a particular location. Let’s take an example:
L9 is directly reachable from L8. So, if a robot goes from L8 to L9 and vice-versa, it will be rewarded by 1. If a location is not directly reachable from a particular location, we do not give any reward (a reward of 0). Yes, the reward is just a number here and nothing else. It enables the robots to make sense of their movements helping them in deciding what locations are directly reachable and what are not. With this cue, we can construct a reward table which contains all the reward values mapping between all the possible states (locations).

In the above table, we have all the possible rewards that a robot can get by moving in between the different states.
Now comes an interesting decision. Remember from the previous sections that the luthier prioritized L6 to be of the topmost? So, how do we incorporate this fact in the above table? This is done by associating the topmost priority location with a very higher reward than the usual ones. Let’s put 999 in the cell (L6, L6):

We have now formally defined all the vital components for the solution we are aiming for the problem discussed above. We will shift gears a bit and study some of the fundamental concepts that prevail in the world of reinforcement learning. We will start with the Bellman Equation.
The Bellman Equation
Consider the following square of rooms which is analogous to the actual environment from our original problem but without the barriers.

Now suppose a robot needs to go to the room, marked in green from its current position (A) using the specified direction.

How can we enable the robot to do this programmatically? One idea would be to introduce some kind of footprint which the robot would be able to follow like below.

Here, a constant value is specified in each of the rooms which will come along the robot’s way if it follows the direction specified above. In this way, if it starts at location A, it will be able to scan through this constant value and will move accordingly. But this would only work if the direction is prefixed and the robot always starts at location A. Now, consider the robot starts at the following location:

The robot now sees footprints in two different directions. It is, therefore, unable to decide which way to go in order to get to the destination (green room). It happens primarily because the robot does not have a way to remember the directions to proceed. So, our job now is to enable the robot with a memory. This is where the Bellman Equation comes into play:
where,
- s = a particular state (room)
- a = action (moving between the rooms)
- s′ = state to which the robot goes from s
- 𝜸 = discount factor (we will get to it in a moment)
- R(s, a) = a reward function which takes a state s and action a and outputs a reward value
- V(s) = value of being in a particular state (the footprint)
We consider all the possible actions and take the one that yields the maximum value.
There is one constraint however regarding the value footprint i.e. the room, marked in yellow just below the green room will always have a value of 1 to denote that it is one of the nearest room adjacent to the green room. This is also to ensure that a robot gets a reward when it goes from the yellow room to the green room. Let’s now see how to make sense of the above equation here.
We will assume a discount factor of 0.9.

* For this room (read state) what will be V(s)? Let’s put the values into the equation straightly:
Here, the robot will not get any reward for going to the state (room) marked in yellow, hence R(s, a) = 0 here. The robot knows the values of being in the yellow room hence V(s′) is 1. Following this for the other states, we should get:

A couple of things to notice here:
- The max() function helps the robot to always choose the state that gives it the maximum value of being in that state.
- The discount factor 𝜸 notifies the robot about how far it is from the destination. This typically specified by the developer of the algorithm that would be instilled in the robot.
The other states can also be given their respective values in a similar way:

The robot now can proceed its way through the green room utilizing these value footprints even if it is dropped at any arbitrary room in the above square. Now, if a robot lands up in the highlighted (sky blue) room, it will still find two options to choose from. But eventually, either of the paths will be good enough for the robot to take because of the way the value footprints are now laid out.
Notice how the main task of reaching a destination from a particular source got broken down to similar subtasks. It turns out that there is a particular programming paradigm developed especially for solving problems that have repetitive subproblems in them. It is referred to as dynamic programming. It was invented by Richard Bellman in 1954 who also coined the equation we just studied (hence the name, Bellman Equation). Note that this is one of the key equations in the world of reinforcement learning.
If we think realistically, our surroundings do not always work in the way we expect. There is always a bit of stochasticity involved in it. This applies to a robot as well. Sometimes, it might so happen that the robot’s inner machinery got corrupted. Sometimes, the robot may come across some hindrances on its way which may not be known to it beforehand. Sometimes, even if the robot knows that it needs to take the right turn, it will not. So, how do we introduce this stochasticity in our case? Markov Decision Processes.
Comments
Post a Comment