Ml Intro 2

4 minute read


Deep Q Learning

Recap is Q Learning is when the agent takes values based on the action. Well simple q-learning will not work in complex environments. So to better suit our agent to act we need to implement an artificial neural network which will then be providing Q values. In the maze example show in the first tutorial we will now be adding the x and y coordinates of our environment to our neural network and we will be working from there. Q learning and deep q learning both want to choose the best possible action, but deep q learning does it better.

What happens to Q learning : How does our agent learn

Here’s what was happening originally.

  1. Take an action, get a reward for taking the action in the state.
  2. Evaluate the value of the state he’s in, which is the maximum of the new q values (actions he can take)- that will be the new empirical q value (verified by logic) plus the reward.
  3. Ideally the new q value should equate to the prior q value he had in memory. Meaning that he got exactly what he had calculated for. ie. What you got minus what you were expecting. Should be ideally zero.
  4. The alpha learning rate will then adjust the q value. Comparison basis: The new q value with the old (what was predicted). So comparing prediction with what happens after.


// Temporal difference is about getting the q value

How we change.

  1. Through initial run, get q values using trial and error to populate the q table.
  2. Get position X1 and X2 of state. As a vector
  3. Feed them into the neural net for processing .
  4. Output the possible q values in the particular state.
  5. Compare the values to the ones that had been stored previously for all the actions.
  6. Loss is backpropagated. So that next time the weights are better suited in describing environment.

Comparison basis: Predicted action values compared with what was initially calculated in the previous step. So comparing prediction with what had happened before.

Deep Q Learning Acting : How our agent Decides

The q values are then passed through a softmax function (or any action selection policy) and selects the best action possible. Each time the game ends, thats another epoch. Each state will pass through the neural network.


Deep Q Learning Feature : Experience Replay

This is an essential feature for deep q learning. The neural network is triggered each time we run into a new state. But in real life scenarios, states might not change frequently. (Become monotonous like a self driving car driving in the desert, some sensors might not change.) This will invariantly affect our weights. Meaning the neural network will risk being biased to take the same action. It will become very good at only one thing? So how do we add this variety?

So as our agent gets experiences, it doesn’t learn there and then. It takes a uniformly distributed sample (batch) and learns from that experiences (passing though the network). So it also won’t lose unique/rare experiences. It gives you opportunity to learn faster from environments with a shortage of experiences.

Deep Q Learning Feature : Action Selection Policies

This is when the agent chooses the action to take. Previously we used the softmax function (above) but that is not the only one. Why do we have so many action selection policies ? Its about the balance between exploration and exploitation for our agent. So if our agent gets a negative\bad reward from our environment. It is forced to explore some more. If it gets a good reward it will then just exploit. The policies will now also assist in ensuring that the agent wont get stuck in a local maximum (there might be better rewards with other actions) There are three popular action selection policies

\begin{equation}\left (\epsilon\ greedy \right) \end{equation}Select action with the best q value. Except for epsilon % of the time (it will select at random)
\begin{equation}\left (\epsilon\ soft (1- \epsilon\ ) \right) \end{equation}Opposite of the greedy version. (Inverted)
SoftmaxMore advanced version.\begin{equation}\left (f_j(z) =\frac{e^{zj}}{\sum_k e^{zk} }) \right) \end{equation} Basically takes all the values and squashes them to be between 0 and 1. These will be probabilities and they will indicate the distribution of the actions we will be taking eg if 0.5 it means 50% of the times choose this value.