I remember watching AlphaGo beat Lee Sedol in Go back in 2016 with my roommates and thinking that I was feeling how people must have felt when they watched the moon landing live. This is monumental.
Since then, progress has not slowed down. Back in April OpenAI Five secured a decisive win against world champions OG in the game Dota 2, demonstrating a mastery of not just local fights but global strategy. The games were beautiful, and even though a lot more attention has been given to the second game in which Five utterly destroyed the opponents, I think that the first game was possibly more interesting. In particular, it contained a fascinating moment in which Five’s win confidence skyrocketed after a seemingly inconsequential battle. Whilst the human commentators called it a fair game, Five reported a 95% confidence that it would win. It then proceeded to steamroll down the mid lane, ultimately winning.
The interesting thing about AlphaGo and OpenAI Five as opposed to the game playing AI’s we’ve had in the past like Deep Blue is that whereas Deep Blue was programmed to play chess, AlphaGo and OpenAI Five were only programmed to learn. The reason why their victories are monumental is not that they beat the humans (though that is of course impressive), but rather that they did so by actually learning the game as opposed to being told what to do. AlphaGo made moves that top Go players considered creative and original. Instead of the trope in which computers can only do what humans tell them to do, they are coming up with novel problem solutions and innovative strategies.
DeepMind, the company that developed AlphaGo, has been doing work on having agents learn to play real-time strategy games in a 3D environments, more similar to Dota 2 than Go. They have a recent paper out in Science describing their system and some of the conclusions drawn from it.
***The original paper can be found here.***
The paper does two main things. First, it describes how the learning algorithm works. Second, it explores how the agents “think” about the game during the learning process and while playing. I think this second bit is actually the more interesting part, so after briefly describing the learning algorithm I’ll focus on the analysis of the agents.
The game that the agents are learning to play is a modified version of Quake III Arena. The game is a version of capture that flag. Teams face off against each other in a small three dimensional arena. There are two bases, one for each team. A team scores a point by finding the enemy flag (which spawns at the enemy’s base), picking it up, and bringing it back to their own base. Players can “tag” opponents on the enemy team, shooting them with a laser which sends the opponent back to their own base after a short delay.
The only information that the agents have access to are the pixels and the game points scored. This is super interesting, as in most contexts we tend to give the AI more information about the game. For example, we might give the AI direct access to its health, or its position on the map as coordinates, or how far it is from the base, etc. Humans don’t directly have access to this information; they have to read it off the screen. Even though this can seem trivial to us since, well, we just read it off the screen, we don’t often realize how much cognitive processing is actually needed to convert those pixels into information that we can effectively use. The idea here was to put the AI on an even playing ground. Don’t give it all the information directly; it has to learn how to extract it from the visuals of the game.
As I said, I’m not going to focus too much on the exact details of the learning algorithm. If you are curious I recommend reading the paper. However, I will highlight some of the main parts of the learning algorithm.
There are two levels of optimization going on. On one level we have the agent learning about the game. The agent uses reinforcement learning to refine and update its probability distribution over actions. The details are complicated, but the general idea is this. At each moment the agent has access to certain information — remember from earlier, the pixels and the score. Also at each moment the agent has a probability distribution over actions it might perform, based on the information it has. For example, if there are three possible actions, say, RIGHT, LEFT, and SHOOT, then the agent might assign probability 0.1 to RIGHT, 0.7 to LEFT, and 0.2 to SHOOT. The agent then samples this distribution. You can think of it in this example as the agent drawing a ball from an urn. If RIGHT corresponds to red balls, LEFT to blue, and SHOOT to green, then the urn has 1 red ball, 7 blue, and 2 green. So even though most of the time the agent will choose a blue ball, sometimes it will choose green or red.
The outcome of the agent learning is that it changes it’s probability distribution (really its conditional probabilities, since we want to incorporate the information it has at each moment) based on how well or poorly it is doing (this is the reinforcement part — good actions are reinforced and bad actions are punished). There are of course subtleties here over which I am eliding (for example, neural networks are involved), so this is a super general sketch of the way the agent learns.
The second level of optimization is at the population level. In many other cases of having an agent learn to play a game — for example, AlphaGo — the agent plays itself to learn. However, in this case, DeepMind actually had a population of agents each learning by playing each other. Each agent had slightly different learning parameters (that make them learn slightly differently than other agents), and thus different ways of playing the game. They then throw an evolutionary dynamic on top of this population, having unsuccessful agents die off and being replaced by mutated versions of more successful agents in the population. This is a super cool way of doing things — are evolving up successful learners.

can be found in movie S1.
We can see this process in the above figure. The description from the figure is their own description in the paper. I’d also highly recommend watching the videos they have, especially video Movie S2 for the learning process.
How did the agents do? Very well (FTW agents are the best of the artificial agents):
The FTW agents clearly exceeded the win-rate of humans in maps that neither agent nor human had seen previously—that is, zero-shot generalization—with a team of two humans on average capturing 16 fewer flags per game than a team of two FTW agents…. Only as part of a human-agent team did we observe a human winning over an agent-agent team (5% win probability). This result suggests that trained agents are capable of cooperating with never-seen-before teammates, such as humans. In a separate study, we probed the exploitability of the FTW agent by allowing a team of two professional games testers with full communication to play continuously against a fixed pair of FTW agents. Even after 12 hours of practice, the human game testers were only able to win 25% (6.3% draw rate) of games against the agent team (28).
p. 364
So we see that the agents do very well, even against a professional human team with full communication — a resource that the agents lack (they don’t get to communicate with each other). The authors go on to look at the effect in reaction time. Even when they slow down the AI to human reaction time, the professional humans only won 30% of the time. Reaction time was not the main difference maker.
Now I’ll move on to the agent analysis portion — where they take a look at how the agents are “thinking” about the game at different points throughout their training process. This is a super key part of understanding these agents. Neural networks and other machine learning algorithms are notorious for being opaque to our understanding — we can see what decisions they make, but not why they make them. This is why I find their analysis super interesting — they did a great job of peering into the black box.
In order to do this, the researchers first extracted information about the state of the game from the game engine itself. The researchers chose certain features — such as whether or not the agent was carrying a flag, or whether or not they see a teammate on their screen. In their words, they “say that the agent has knowledge
of a given feature if logistic regression on the internal state of the agent accurately models the feature. In this sense, the internal representation of the agent was found to encode a wide variety of knowledge about the game situation” (p. 364).
This is super cool. By picking up a set of features and seeing if the agent’s internal state is correlated with these features, they can figure out whether or not the agent is tracking certain features of the game. This is what I meant by peering into the black box — even though if I looked at the agent’s internal state directly it would be totally opaque to me, by running this analysis we can get a better handle on how the agent thinks about this. We can talk about the agent’s internal state having meaning.
The researches also tracked when agent started to pay attention to different features:
Looking at the acquisition of knowledge as training progresses, the agent first learned about its own base, then about the opponent’s base, and then about picking up the flag. Immediately useful flag knowledge was learned before knowledge related to tagging or their teammate’s situation. Agents were never explicitly trained to model this knowledge; thus, these results show the spontaneous emergence of these concepts purely through RL-based training.
p. 364
There are many philosophical positions and puzzles concerning abstraction — how we move from empirical observations or examples to more general concepts. Paying attention to how these agents do it might pay off philosophically.
The researchers also tried to get a better handle on not just what the agents were representing, but how:
We also found individual neurons whose activations coded directly for some of these features— for example, a neuron that was active if and only if the agent’s teammate was holding the flag, which is reminiscent of concept cells (56). This knowledge was acquired in a distributed manner early in training (after 45,000 games) but then represented by a single, highly discriminative neuron later in training (at around 200,000 games).
p. 364
This seems like it could be very interesting for its connections to neuroscience, and understanding how the human brain represents different concepts. The dynamics also seem like they’d be interesting — moving from a more distributed pattern early on to a single neuron later. I’m not particularly well-versed in this field, so I’ll leave it aside — I’d be happy for speculation in the comments, though.
They also found that the agents had rich representations of the terrain. Remember, the agents play on randomly generated maps — so they are not just memorizing one map, but quickly learning to navigate a new one. This speaks to a certain flexibility the agent has in its abilities.
Finally, they also looked at the various strategies the agents would employ, and when they discovered (invented?) them in the learning process:
Analysis of temporally extended behaviors provided another view on the complexity of behavioral strategies learned by the agent (57) and is related to the problem a coach might face when analyzing behavior patterns in an opponent team (58). We developed an unsupervised method to automatically discover and quantitatively characterize temporally extended behavior patterns, inspired by models of mouse behavior (59),which groups short game-play sequences into behavioral clusters (fig. S9 and movie S3). The discovered behaviors included well-known tactics
p. 364
observed in human play, such as “waiting in the opponents base for a flag to reappear” (“opponent base camping”), which we only observed in FTW agents with a temporal hierarchy. Some behaviors, such as “following a flag-carrying teammate,” were discovered and discarded midway through training, whereas others such as “performing home base defense” are most prominent later in training
Just how before we were able to understand what parts of the game the agents were tracking, here we are able to understand how the agent is thinking about the game strategically. We can also understand how the agents are thinking about the game during the learning process. If you think about your own experience playing a game this might sound familiar; you start off discovering simple strategies, but these are often not optimal. When you face superior opponents, you have to search for better strategies. Over time, you become a better player.
The level at which these agents have learned to play is quite impressive. However, what is (perhaps?) equally or even more impressive is the work the folks at DeepMind have done that can help us understand how our artificial agents are thinking about things. This kind of analysis will prove invaluable when we start having artificial agents do real work for us in the real world, outside of a game. We better know what they are up to.