Ml Intro 4

2 minute read


A3C (Asynchronous Advantage Actor-Critic)

It is a cutting edge AI algorithm. It is far better than deep convolutional q learning and requires less training but at the same time producing better results. It has multiple modifications but its simply the best. It was developed by Google’s DeepMind. This algorithm was first mentioned in 2016 in a research paper appropriately named Asynchronous Methods for Deep Learning.

So how do we move from deep convolutional q learning to A3C? Well, lets look at the different parts for this algorithm.

Actor Critic

The output will have a second output. First will be the Q values (Actor) Second will be Value, V(s). The value of the state will be the critic. So we are no longer predicting the q value on but the critic. Q values can also be called policy.



It means that there will be multiple agents going through the environment. This reduces the risk of local maximum. So how will the agents share the information, they will share the same critic. In order to optimize the environment.


Modification Added

The creator of pytorch then modified the above setup for the A3C so that all agents share the same neural network. This will mean that they will share the weights.



In our A3C setup now, we will have 2 types of losses. Value Loss and the policy loss (remember that we now have two outputs from the neural network (The actor and the critic). The value loss similar to the losses we were having before.

What is the policy loss? Lets introduce the concept of advantage

\begin{equation}\left (A = Q(s,a) - V(s) \right) \end{equation}

So the Q value of the action you decided to take minus the value of the state. So why do we calculate the Advantage. The q value is coming from what the neural net predicted whereas the V value is the one dictated by the critic. The critic tells us what the known value of the state is. So basically it wants to compare how much better the Q value is compared to the V Value. This is the advantage the action over the known state. So it will also be back propagated. Back-propagation now happens to maximize the advantage.

A3C Addon: Long short-term memory

A LSTM is an extra layer that will add memory to your algorithm. It removes need for an extra hidden layer afterwards. This layer will produce the actor and critic.


It remembers the previous values. The constant value and the previous output on a high level.

Why do we need memory? So we might need to remember so that we could observe like knowing if ball is headed towards you.