When it comes to competitive games, AI systems have already shown they can easily mop the floor with the best humanity has to offer. But life in the real world isn't a zero sum game like poker or Starcraft and we need AI to work with us, not against us. That's why a research team from Facebook taught an AI how to play the cooperative card game Hanabi (the Japanese word for fireworks), to gain a better understanding of how humans think.
Specifically, the Facebook team set out to instill upon its AI system the theory of mind. "Theory of mind is this idea of understanding the beliefs and intentions of other agents or other players or humans," Noam Brown, a researcher at Facebook AI, told Engadget. "It's something that humans developed from a very early age. But one AIs have struggled with for a very long time."
"It's trying to put itself in the shoes of the other players and ask why are they taking these actions," Brown continued, "and being able to infer something about the state of the world that it can't directly observe."
So what better way to teach an AI to play nice and empathize with other players than through a game that's basically cooperative group solitaire? Created by French game designer Antoine Bauza in 2010, Hanabi charges its two to five players to construct five, 5-card stacks. Each stack is color coded (like solitaire's suits) and must be ordered numerically from one to five. The goal is to complete all the stacks or get as close to 25 points (five points per stack/five stacks) as possible once the team has run out of moves. The wrinkle to Hanabi is that none of the players know what's in their hands. They have to hold their cards facing away from themselves so while they don't know what they hold, their teammates do and vice versa.
Players can share information with their teammates by telling them either the color or number of cards in their hands. That information is limited to either "you have X number of blue cards" or "you have X number of 2 cards" while pointing to the specific cards. Furthermore, sharing information comes at the cost of one "information token." The number of these tokens is limited, which prevents the team from using all of the tokens at the start of the game to fully inform themselves of what everybody is holding. Instead, players have to infer what they're holding based on what their teammates are telling them and why they think their teammates are telling them at that point of the game. Basically it forces players to get into the headspace of their teammates and try to figure out the reasoning behind their actions.
To date, the AI systems that have bested human players in Go and DOTA2 have relied on reinforcement learning techniques to teach themselves how to play the game. Facebook's team improved upon this system by incorporating a new real-time search function, similar to the one used by Pluribus when it curb-stomped five Texas Hold 'Em pros in June.
"This search method works in conjunction with a precomputed strategy, allowing the bot to fine-tune its actions for each situation it encounters during gameplay," Facebook's Hengyuan Hu and Jakob Foerster wrote in a blog post. "Our search technique can be used to significantly improve any Hanabi strategy, including deep reinforcement learning (RL) algorithms that set the previous state of the art."
The "precomputed strategy" is known as the blueprint policy. It's the generally accepted strategy and conventions that all the players agree to ahead of time. In Hanabi, those conventions are basically "don't lie to the others about what they're holding" and "don't intentionally tank the game."
"The way humans play is they start with a rough strategy, which is kind of what we call a blueprint here," Facebook AI researcher Adam Lerer told Engadget. "And then they search locally based on the situation they're in, to find optimal set of moves assuming that the other players are going to be playing this blueprint."
Facebook's Hanabi AI does the same thing. Its search technique first establishes a rough "blueprint" of what could happen as the game unfolds and then uses that information to generate a near-optimal strategy in real time based on what cards are currently in play. What's more, this system can designate either a single player as the "searcher" or multiple players. A searcher in this case is one player who is capable of interpreting the moves of their teammates, all of whom are assumed to operate under the blueprint policy.
In a "single-agent search," the searcher maintains a probability distribution as to what cards it thinks it's holding and then updates that distribution, "based on whether the other agent would have taken the observed action according to the blueprint strategy if the searcher were holding that hand," according to the blog post.
"Multi-agent" search is a far more generalized and complicated function, more than we need to cover today, but it essentially enables each player to replicate the search the previous player ran to see what strategies their searchers came up with. While single-agent search provides enough of a predictive boost to put AI players ahead of even elite human Hanabi players, multi-agent search results in near-perfect 25-point scores.
The current state of the art RL algorithm, SAD, averages 24.08 points in two-player Hanabi. Strapping a single-agent search function atop the RL system results in an average score of 24.21 -- that's higher than any solo RL system designed to date. Using multi-agent search jumped that score to 24.61.
"We've also found that single-agent search greatly boosts performance in Hanabi with more than two players as well for every blueprint we tested," the blog post noted, "though conducting multi-agent search would be much more expensive with more players since each player observes more cards."
Getting near perfect scores on an obscure French card game is great and all but Facebook has bigger plans for its cooperative AI. "What we're looking at is artificial agents that can reason better about cooperative interactions with humans and chatbots that can reason about why the person they're chatting with said the thing they did," Lerer explained. "Chatbots that can reason better about why people say the things they do without having to enumerate every detail of what they're asking for is a very straightforward application of this type of search technique."
The team also points towards potential autonomous automotive applications. For example, self-driving vehicles that infer from the cars slowing and stopping ahead of them that they're doing so because a pedestrian is crossing the road, without having to see the person in the crosswalk themselves first-hand. More immediately, however, the team hopes to further expand upon its research, this time into mixed cooperative-competitive games like Bridge.