Reinforcement Learning

A Gentle Introduction

06 May, 2020

Let's play a game by assuming that your goal is to earn a million dollars. There are certain steps, if taken correctly, you will achieve this goal. One of those steps is to decide to go out for dinner; considering there are three options; X, Y and Z. Which one will you choose?

When you decide to go to a restaurant, its
EITHER you select at random to explore
OR you want to exploit your previous experience of going to a "tried and tested" restaurant.

Let's assume you chose X. All actions have consequences. Therefore, going to restaurant X will have consquences. Let's Refer to cosequences as value or quality.

Revising our story so far;
Before going out, you might be hungry, tired or both. That is why you decided to go out for dinner.
So, when hungry and tired, the current state, you took an action of going to "randomly selected" or "best available" restaurant X. The value (Quality) of this action is based on your past-experience. Let's call it Expected Satisfaction.

Once you reached the restaurant X, ordered food and started socializing, you made friends who can bring you business. Thus going to restaurant X moved you one step closer to your goal. You feel exuberant. What will you do now?
I am sure you would like to come to this restaurant X again. Right?

After meeting business connections the consequences (value, quality) of the action cannot remain same. The quality will improve (new business coming) or worsen (things happen, maybe the new friends were not that good). Let's call this Actual Satisfaction.

Now, add a twist to this game;
Sometimes, there can be a situation when you decide to go out for dinner but you miss out an important task. Like, a deadline to submit the project or make an important client call.
Life is not all about just having good food. Agreed? You are supposed to live with constraints and select actions that maximize benifits. A constraint is a way the environment tells you that you just did a good job or bad job. Constraint or rewards is the second piece of information that you will need to find the steps to become a millionaire. To make things more interesting, let's assume that you know the constrainsts/rewards only after you take an action. Going out with friends and finding out later that you failed a project or lost a client during that time. Its feel worse. right?

After meeting business connections the consequences (value, quality) of the action cannot remain same. The quality will improve (new business coming) or worsen (things happen, maybe the new friends were not that good). The total increase or decrease in benifit of going to Resurant is;

TD = Constraint + (Actual Satisfaction - Expected Satisfaction)
In other words
TD = Reward for Action + (Quality of Best possible Action in Future State - Quality of Action in Current State)

This is temporal difference (TD) between your satisfaction level before and after.

All you need is to keep updating your quality of actions using TD and keep following the best quality actions (with a little bit of exploration).

Quality of Action in Current State = Quality of Action in Current State + TD

Theoretically, one day you will win this game by finding the best steps to achieve your goal.

The theory is rooted in our understanding of how brain learns. Our brain, acting as an Agent on our behalf, starts working since the day we are born, and keep helping us seemlessly navigate the complexities of Environment.
It keeps track of accumulated benefits over a long period of time to REINFORCE good behaviours and action.
If you want to make a machine work like our brain, then in after every interaction calculate TD and update the quality of action and voila!, your machine will be able learn complex problems through trial and error.

This is the intuition behind Reinforcement Learning.

In order to learn the Reinforcement Learning by doing it, I am sharing an Interactive tutorial. Please try and share your feedback.

Interactive Tutorial

Tutorial Videos



Dr. Uzair Ahmad

Machine Learning practitioner, technology inventor, author and educator.