Published:

Learn by doing - Reinforcement learning AI

What is machine learning ?

Machine Learning is a form of Artificial Intelligence in which the program is designed to learn on its own. So how does this machine learn on its own ? There are three types of how this learning can happen. Supervised, unsupervised and reinforcement learning.

Lets quickly get to the bottom and know who’s who in these three machine learning paradigms.

So what is learning? Lets imagine we have this equation

\begin{equation}\left (y = f(x) \right) \end{equation}

We get output (knowledge application) from applying a function to some input (maybe studying) We live in a world with input and output. The three learning paradigms now differ as to where the focus will be on.

TypeDescriptionNotes
SupervisedKnown output (training set). Given x lets predict y. (y=mx+b)Regression and Classification most common.
UnsupervisedGiven x simplify x. Find structure/pattern in the data. Ask questions to determine the answer.Clustering, Mean shift and auto encoding are the most common
ReinforcementNo x value. Reach y and get rewards.Labels can be classification or regression

Reinforcement Learning

Key concepts/ elements of reinforcement learning

The main goal of RL is to maximize the reward. The agent will be the software and it follows a policy (algorithm) and conduct an action. That action will create a reward and that reward is added to the cumulative reward sum (reward signal). So in essence

1. Agent starts exploring the environment
2. Environment then determines Policy. to create a policy that agent will start to follow action.
Element Of RLDescription
PolicyBasically defines the learning behavior. Maps state to action. 𝝿(s)→a, 𝝿*
Reward SignalThe reward of taking a particular action. R(s,a). Can be negative (Living Penalty)
Value FunctionThe accumulation of the rewards the agent will get. V(s)
ModelMimics the environment the agent will be placed in.

Okay I hear you - so how do we get the policy from the environment?

There are three methods

• Brute Force - Look up every possible policy over every possible state. Highly inefficient.
• Policy Gradient - Explore state space and uses neural networks. Environment wont have to have the Markov Property but its more complicated to set up.
• Value Function - Popular and robust. Environment must have the markov property. Several implementation but what we will use is the q learning value function. Q Learning can be implemented via TEMPORAL DIFFERENCE or SARSA.

Dynamic Programming

Well this in simple english is breaking down a complex/large programming problem. It uses memoization(caching) to help reduce computational complexity. The solutions to the smaller problems are cached and add on to the solution of the larger example.

An example would be factorial.

Problem : Solve for 5!

Solution Description:

5! = 5 times 4 times 3 times 2 times 1

Lets break it down to smaller pieces

5! = 5*4! and so on

def factorial(n):
if n == 1:
return 1
print (str(n) + " * factorial "+ str((n-1)))
return n*factorial(n-1)

factorial(5)
5 * factorial 4
4 * factorial 3
3 * factorial 2
2 * factorial 1

120

Q Learning step 1 : Set Table

This will be setting up a table with rows that are the state and columns will be the action like this

StateAction
S1A1
S2A2
S3A3
S4A4

Q Learning step 2 : Calculate Q Values

The q value will be the optimal state action value. The agent (during exploration will populate the Q Values) The Q values will be calculated via temporal difference or sarsa method.

Temporal Difference: Take a look first before populating the Q Table. More suitable for offline implementation. Sarsa : (State Action Reward State Action) Populate as you make actual moving. More suitable for online implementation.

Prototyping RL models

After making your RL model- you might want to test it out first. So how can you test it? You need an environment. And how do you get the environment ? You use platforms. The ones worthy of mention are

• Deepmind Lab
• Vizdoom
• Minecraft
• OpenAI Gym

1. Richard Bellman

He is the father of dynamic programming. Dynamic programming is the art of breaking down a big problem into a set of much much smaller problems and then solving them one by one recursively.

\begin{equation}\left (V(s) = max_\alpha (R(s,a)+ \gamma V(s’)) \right) \end{equation}

In English: The value of the current state (V(s)) is the highest possible value of the reward function after taking an action (R(s,a)) added to the discounted (γ) value of the next step (V(s’)).

Cumulative Reward =Reward +Expected Value Of the Future
\begin{equation}\left (V(s) = max_\alpha \right) \end{equation}\begin{equation}\left ( (R(s,a)+) \right) \end{equation}\begin{equation}\left ( \gamma V(s’)) \right) \end{equation}
1. So why do we call out on the max function ? - Because there are MANY actions we can take. So we assume we will take the action with the highest reward. We will choose the highest value(greedy)
2. Why do we discount? - Well that is because the value will have to gradually increase as we move closer to the goal and vis-versa.
3. Why do we need the equation - _Imagine we were in a maze like the one below. How will we know which way to go if we were randomly placed in any box.

Plugging in the formula to the maze.

This will be the result of the maze values after applying the equation recursively (Now you see where dynamic programming fits in right?) However this is just one part of it. Deterministic. What of the random scenarios? What if our agent takes whisky and the route they take is now not so definite (non-deterministic)? Enter ..

2. Markov Decision processes

This is to cater for those scenarios where decisions are partly random and partly under the control of the decision maker.It is a discrete time stochastic control process. It is an add-on to the Bellman decision process.

The Markov Property

This essentially says the past does not matter (Inspirational huh?) The conditional probability of the future is solely dependent on the current state only.

The equation : (Adding randomness) Modifying Bellman.

The area being modified in the Bellman equation is V(s’) which will have to cater for (if our agent can take 3 states only)

\begin{equation}\left (Vs_1 ‘ , Vs_2 ‘, Vs_3 ‘ \right) \end{equation}

We will then have to multiply the three states with their probabilities of getting into those states.

\begin{equation}\left ( 0.8Vs_1 ‘ … 0.1Vs_2 ‘… 0.1Vs_3 ‘ \right) \end{equation*} (This is not the gamma, just the probability)

Then the entire formula can then be rewritten as

\begin{equation}\left (V(s) = max_\alpha (R(s,a)+ \gamma \sum_{s’} (P(s,a,s’)V(s’)) \right) \end{equation}

Now lets compare the deterministic with the non deterministic (stochastic approach)

Approach:Cumulative Reward =Reward +Expected Value Of the Future
Deterministic:\begin{equation}\left (V(s) = max_\alpha \right) \end{equation}\begin{equation}\left ( (R(s,a)+) \right) \end{equation}\begin{equation}\left ( \gamma V(s’)) \right) \end{equation}
Stochastic:\begin{equation}\left (V(s) = max_\alpha \right) \end{equation}\begin{equation}\left ( (R(s,a)+) \right) \end{equation}\begin{equation}\left ( \gamma \sum_{s’} (P(s,a,s’)V(s’)) \right) \end{equation}

Q learning application

From State to Action

We will now further move to improve our function and instead of offering reward based on the state, we will now focus on reward based on the action. So where does the q fit in? Well we want to now move from knowing which state is more lucrative to which action is more lucrative. So we know that actions lead to states- therefore there must be some correlation between the two.

Why is it called q-learning? :Most assume that the q stands for quality learning.

Value of a state V(s) is the same as quality of an action Q(s,a)

Now we will move to modify our equation to become

EquationDescriptionNotes
\begin{equation}\left (V(s)) \right) \end{equation}\begin{equation}\left ( = max_\alpha(R(s,a)+ \gamma \sum_{s’} P(s,a,s’)V(s’) \right) \end{equation}Original one
Therefore \begin{equation}\left (Q(s,a)) \right) \end{equation}\begin{equation}\left ( = R(s,a)+ \gamma \sum_{s’} (P(s,a,s’)V(s’)) \right) \end{equation}We know that by just moving theres some reward (even a zero) so R(s,a) + the expected reward. Max removed. Replaced V(s) With Q(s,a)

Before we proceed ! Notice that

\begin{equation}\left (Q(s,a) = (R(s,a)+ \gamma \sum_{s’} P(s,a,s’)V(s’) \right) \end{equation}

Meaning that in “Original one” (Table above) we were taking the maximum value across all possible actions as a result of taking each of those actions. But in Q(s,a) we are defining what we will get by taking a certain action.

TLDR; The only difference between V(s) and Q(s,a) is the max. So we will take the V and replace with Q.

Meaning that the value of a state is the value of the max of all possible q values. This validates our formula. Lets proceed to get rid of the V entirely as we can see its recursive.

EquationDescriptionNotes
\begin{equation}\left (Q(s,a)) \right) \end{equation}\begin{equation}\left ( = R(s,a)+ \gamma \sum_{s’} (P(s,a,s’)max_\alpha Q(s’,a’)) \right) \end{equation}Replaced V(s’) with Q. No more V’s

We look at possible action and then decide which action to take. Instead of comparing the different states it can end up in, it now compares the different actions it can take. Remember that our policy must define the actions ultimately.

Temporal Difference (TD)

There is very high stochasticity in determining the value of a state. TD is basically how you make your estimates of future rewards so that the TD error gets smaller.

For now (for simplicity purposes) lets revert back to the deterministic Bellman equation. We can convert it to stochastic but we just want to keep it readable.

\begin{equation}\left (V(s) = max_\alpha (R(s,a)+ \gamma V(s’)) \right) \end{equation} Original Bellman equation.

Lets imagine our agent moving

\begin{equation}\left ( qValueBefore = Q(s,a)) \right) \end{equation} \begin{equation}\left ( qValueAfter = R(s,a)+\gamma max_\alpha’ Q(s’,a’)) \right) \end{equation}

The temporal difference will be defined as BEFORE - AFTER

\begin{equation}\left (TD(a,s) = R(s,a)+\gamma max_\alpha’ Q(s’,a’) - Q(s,a) \right) \end{equation}

Ideally BEFORE - AFTER should give us 0. Because ‘after’ is a formula to get the ‘before’. Q(s,a) was calculated before in one of the previous iterations made by our agent and the after was just calculated just now. Its called temporal difference because of the shift in time. Should we completely discard the old Q value since we have the new value? No, because the new value might have been the result of a random move made by our agent. We don’t want to completely change our q values. \begin{equation}\left (TD(a,s) = R(s,a)+\gamma max_\alpha’ Q(s’,a’) - Q_s(s,a) \right) \end{equation} \begin{equation}\left (Q_t(s,a) = Q_s(s,a)+\alpha TD_s(a,s) \right) \end{equation}

NOTE: Alpha is our learning rate Between 0 & 1. s= t-1.

In english : agent will take an action. Agent gets a reward and ends up in a state. Based on that he calculates like this, “What should have been the Q value of the move that I made?”. Then compares. If the environment is constantly changing, then you will have to continually calculate the temporal differences and the q value.