DeepMind AI Beats Professional StarCraft II Players

Google-owned artificial intelligence lab DeepMind announced on Thursday that its new "AlphaStar" AI had beaten two of the world's best StarCraft II players.

Dario Wunsch and Grzego rz Komincz, ranked 44th and 13th in the world respectively, both lost 5-0 by DeepMind's AlphaStar in two separate best-of-five matches.

The victories are being hailed as a major breakthrough by academics and AI industry watchers, cosnidering the complexity of StarCraft.

StarCraft is a complex real-time strategy (RTS) game and building an AI capable of beating the best human players has emerged as a "grand challenge" over the last few years. Made by Blizzard Entertainment, StarCraft II is a PC game with various playing modes. The 1v1 tournament is the most popular in esports, and this is the one DeepMind focused on.

“The history of AI has been marked by a number of significant benchmark victories in different games,” David Silver, DeepMind’s research co-lead, said after the matches. “And I hope — though there’s clearly work to do — that people in the future may look back at [today] and perhaps consider this as another step forward for what AI systems can do.”

There are several different ways to play the game, but in esports the most common is a 1v1 tournament played over five games. To start, a player must choose to play one of three different alien “races” - Zerg, Protoss or Terran, all of which have distinctive characteristics and abilities (although professional players tend to specialise in one race). Each player starts with a number of worker units, which gather basic resources to build more units and structures and create new technologies. These in turn allow a player to harvest other resources, build more sophisticated bases and structures, and develop new capabilities that can be used to outwit the opponent. To win, a player must carefully balance big-picture management of their economy - known as macro - along with low-level control of their individual units - known as micro.

The need to balance short and long-term goals and adapt to unexpected situations, poses a huge challenge for systems that have often tended to be brittle and inflexible. Mastering this problem requires breakthroughs in several AI research challenges including:

Game theory: StarCraft is a game where, just like rock-paper-scissors, there is no single best strategy. As such, an AI training process needs to continually explore and expand the frontiers of strategic knowledge.
Imperfect information: Unlike games like chess or Go where players see everything, crucial information is hidden from a StarCraft player and must be actively discovered by “scouting”.
Long term planning: Like many real-world problems cause-and-effect is not instantaneous. Games can also take anywhere up to one hour to complete, meaning actions taken early in the game may not pay off for a long time.
Real time: Unlike traditional board games where players alternate turns between subsequent moves, StarCraft players must perform actions continually as the game clock progresses.
Large action space: Hundreds of different units and buildings must be controlled at once, in real-time, resulting in a combinatorial space of possibilities. On top of this, actions are hierarchical and can be modified and augmented. Our parameterization of the game has an average of approximately 10 to the 26 legal actions at every time-step.

AlphaStar’s behaviour is generated by a deep neural network that receives input data from the raw game interface (a list of units and their properties), and outputs a sequence of instructions that constitute an action within the game.

AlphaStar also uses a novel multi-agent learning algorithm. The neural network was initially trained by supervised learning from anonymised human games released by Blizzard. This allowed AlphaStar to learn, by imitation, the basic micro and macro-strategies used by players on the StarCraft ladder. This initial agent defeated the built-in “Elite” level AI - around gold level for a human player - in 95% of games.

In order to train AlphaStar, DeepMibnd's team built a highly scalable distributed training setup using Google's v3 TPUs that supports a population of agents learning from many thousands of parallel instances of StarCraft II. The AlphaStar league was run for 14 days, using 16 TPUs for each agent. During training, each agent experienced up to 200 years of real-time StarCraft play. The final AlphaStar agent consists of the components of the Nash distribution of the league - in other words, the most effective mixture of strategies that have been discovered - that run on a single desktop GPU.

Professional StarCraft players such as TLO and MaNa are able to issue hundreds of actions per minute (APM) on average.

In its games against TLO and MaNa, AlphaStar had an average APM of around 280, significantly lower than the professional players, although its actions may be more precise. This lower APM is, in part, because AlphaStar starts its training using replays and thus mimics the way humans play the game. Additionally, AlphaStar reacts with a delay between observation and action of 350ms on average.

Additionally, and subsequent to the matches, DeepMind's team developed a second version of AlphaStar. Like human players, this version of AlphaStar chooses when and where to move the camera, its perception is restricted to on-screen information, and action locations are restricted to its viewable region.

Performance of AlphaStar using the raw interface and the camera interface, showing the newly trained camera agent rapidly catching up with and almost equalling the performance of the agent using the raw interface.

Deepmind trained two new agents, one using the raw interface and one that must learn to control the camera, against the AlphaStar league. Each agent was initially trained by supervised learning from human data followed by the reinforcement learning procedure outlined above. Deepmind says that the version of AlphaStar using the camera interface was almost as strong as the raw interface, exceeding 7000 MMR on our internal leaderboard. In an exhibition match, MaNa defeated a prototype version of AlphaStar using the camera interface, that was trained for just 7 days.

"We hope to evaluate a fully trained instance of the camera interface in the near future," Deepmind said.

These results suggest that AlphaStar’s success against MaNa and TLO was in fact due to superior macro and micro-strategic decision-making, rather than superior click-rate, faster reaction times, or the raw interface.

While StarCraft is just a game, albeit a complex one, we think that the techniques behind AlphaStar could be useful in solving other problems. For example, its neural network architecture is capable of modelling very long sequences of likely actions - with games often lasting up to an hour with tens of thousands of moves - based on imperfect information. Each frame of StarCraft is used as one step of input, with the neural network predicting the expected sequence of actions for the rest of the game after every frame. The fundamental problem of making complex predictions over very long sequences of data appears in many real world challenges, such as weather prediction, climate modelling, language understanding and more. "We’re very excited about the potential to make significant advances in these domains using learnings and developments from the AlphaStar project," Deepmind added.

Deepmind believes that some of their training methods may prove useful in the study of safe and robust AI. One of the great challenges in AI is the number of ways in which systems could go wrong, and StarCraft pros have previously found it easy to beat AI systems by finding inventive ways to provoke these mistakes. AlphaStar’s league-based training process finds the approaches that are most reliable and least likely to go wrong.