Beating Top Humans in No-Limit Poker

11/15/2018

So, for now, I'm just going to assume that we're working in the two-player zero-sum setting and it does extend in some cases to other situations as well. So, our goal is to find an approximate Nash equilibrium. We're going to measure performance in terms of exploitability. You can think of it as, distance from a Nash equilibrium, it's how well we would do against a worst-case adversary relative to if we had played a Nash equilibrium instead. So, how exploitable we are Oncasinogames Canada?
I would argue that exploitability is actually extremely important and has been overlooked in the AI community as a whole. I think two recent man-machine matches actually really highlight this. One is the OpenAI, one versus one Dota2 Matches that you might have heard about, and the other is Fan Hui versus AlphaGo. In the OpenAI Matches, they made this AI that was able to beat top humans in one versus one Dota2 over three games. But after they won against the top human, they actually opened it up to the public and they invited random mediocre players to play against it to see if they could find any weaknesses. In fact, pretty quickly within a few thousand games, weak players were able to find certain tricks that they could basically fool the AI and figured out how to exploit it and beat it.
Also, in Fan Hui versus AlphaGo, so they famously beat Fan Hui 5-0. But then after they published the Nature paper, they invited him to play several more matches against it to see if he could find out any weaknesses in the AI. In fact, he was able to find weaknesses where he was able to consistently beat the AI and they had to patch this before they released on the Nature. So, I think what this really demonstrates is that it's not enough to beat top humans in three or five or even 10 games. You really have to be able to consistently beat top humans, especially if you want to deploy an AI into the real world.
If you're Microsoft and you're trying to deploy this products with real users, there's millions or billions of them, if there's a weakness, they're going to find it. But with the [inaudible] , we played the top humans not just in three or five hands of poker, we played them in 120,000 hands of poker over the course of 20 days. That whole time, all four players working as a team to try to exploit the AI in any way they could find. In fact, actually, I had lunch with one of the players, just a couple months ago.
He said that the thing they found most shocking about the competition is that, at the end of each day, we gave them a log of all the hands that were played and we told them what the bot had on each hand that was played. This is big because in poker, a big part of the game is actually keeping your strategy hidden. If you fold, your opponents does see what cards you have. In fact, even if you don't fold but you lose the hand, you still don't show your cards are. So, you only see your opponent's hand about 20 percent, 25 percent of the time.
So, like poker players will sometimes even call, just to see what their opponent had. But here, we're just giving them that information. We're telling them what the bot had on every single hand that it played.
So, they didn't have to worry about that part all, and they found it absolutely amazing that they could not figure out how to exploit the AI, even though we were showing them the hands that the bot was playing every single time and the bot strategy wasn't really changing that much between days. All right, so I think exploitability is extremely important. I think has been overlooked by the AI community, and this is telling that the imperfect information game solving community has focused on throughout its existence.
All right, so now, I want to get to the example of why imperfect information games are hard. I'm going to talk about a simple example that I call a Coin Toss. It starts with the coin flip. So, the coin is flipped that lands heads or tails with 50-50 probability. Player one is going to observe the outcome of the coin toss.
Player two is not. So, after this coin lands, Player one has a choice. They can either sell the coin or they can choose play. We'll say, if they choose sell this to some separate subgame, the details of which are not important.
The only thing that's important is the expected value. So, we'll say, if the coin landed heads, then the coin was lucky and they can sell it for 0.50 cents. On the other hand, if the coin landed it tails, we'll say it's unlucky and Player one loses 0.50 cents by selling it.
On the other hand, they could choose play, and if they choose play, then it leads to Player two, and Player two has to then guess how the coin landed without having observed how it actually landed. So, if they guess correctly that is Player two guesses heads and the coin actually landed heads, then Player one is going to lose one dollar and Player two is going to gain one dollar. Here, the payoffs are shown for Player one because this is a two-player zero-sum game. So, Player two just receives the opposite payoff. Now, on the other hand, if Player two guesses incorrectly that is they guess tails and the coin actually landed heads, then Player one gains one dollar and Player two loses one dollar.
You can see there's a dotted line between the two players, two nodes this signifies that Player two is in what's called an information set. This means that Player two because they did not observe how the coin landed, they do not know which of those two states they were actually in. So, why do you imagine that you are Player two in this game and yeah, so why do you imagine that you are Player two in this game, you've just observed Player one chooses play action and so you know that you are in this imperfect information subgame. So, what should you do?
Should you guess heads or should you guess tails? But one option is to just always guess heads. But if you do that, that's obviously a really bad strategy because now Player two can just sell the coin when it lands heads and get 0.50 cents, and choose play when the coin lands tails and gain a dollar. So, on average they're getting 0.75 cents.
On the other hand, you could always choose tails, but that's also a really bad idea because now Player two can choose play when the coin lands heads and gain a dollar and choose sell when the coin lands in tails and lose 0.50 cents but it's better than losing a dollar. So, on average, they're still getting 0.25 cents in this game. So, it turns out that the optimal strategy is to mix. It's to guess heads with 25 percent probability and tails with 75 percent probability. If you do that, then no matter what Player one does, the best they can do is just break-even, get on average zero dollar in this game.
So, this is the Nash equilibrium strategy for Player two in this game, at least for this subgame. But now, let's say we change the game a little bit. Let's say we changed the payoff for the sell action. So, now, an expectation Player one loses 0.50 cents for choosing sell when the coin lands heads, and gains 0.50 cents for choosing sell when the coin lands tails. Well, it's pretty easy to see that as Player two, your strategy in this subgame should now change as well.
Now, you should be guessing heads with 75 percent probability and tails with 25 percent probability. But you can see what's happened here is that, by changing the expected value of the sell action, we have affected what the optimal strategy is in the play subgame. Even though the sell action is not part of the play subgame and in fact, it's not even on the path leading to the play subgame. So, this is something that happens in imperfect information games. It does not happen in perfect information games. In perfect information games, if you wanted to determine the optimal strategy in subgame, you only need to look at that subgame by itself.
But in imperfect information games, you have to look at the game as a whole. So, you can think of like perfect information games is a special case where you don't have to worry about all this stuff. Imperfect information games are the more general case where this is a problem.
So, what do we do? Well, it turns out that we don't actually have to know the strategy for the entire game as a whole. I mentioned that this sell action leads to a subgame, where both players might take actions. But you don't have to worry about that, the only that really matters for determining the optimal strategy in this play subgame, is the expected value of Player one choosing sell. So, what we can do is try to estimate what that value is to Player one, and if we have that, then we can determine the optimal strategy in the play subgame.
So, that's what we actually did in [inaudible]. We also have a theorem that says, "If this estimate is within delta of the true Nash equilibrium value, then we can solve for the play subgame and get within delta of the Nash Equilibrium." So, in the [inaudible] , we actually do this. We have this massive game which is simply way too large to solve upfront. So, we come up with a really good strategy just for the early part of the game, and we estimate what the optimal strategy is and what the expected values are in the later parts of the game. Now, when we're actually playing, we find ourselves in a particular subgame, we come up with a much better strategy for that particular subgame using information about the expected values from the other subgames.
Then, we repeat this process, we just come up with a really good strategy for that early parts that are coming up and just estimate how to play in the later parts. We find ourselves in early subgame, we again compute a much better strategy and that particular subgame using information about the expected values of the other subgames. That's called nested subgame solving. This was the key breakthrough that allowed us to be top humans.
So, when I, yes? >> Just [inaudible]. >> Yes, that's a great question. So, actually when we do this, this is sort of a general, how we would do this in general. But in poker, we solved the first two, there's four betting rounds. So, we solve the first two, with a pre-computed strategy.
Because it's like each round grows exponentially in size. So, the first two rounds are actually pretty small. We got to the end of the second betting round, that's when we applied Subgame Solving. So, we came up with a much better strategy for the remainder of the game. We abstracted the bets. So, we want to consider all the 20,000 different bet sizes, we would just consider a small fraction of them.
Then each time the opponent acted, each time they've made a bet, then we would solve a new subgame for that bet size. So, we would apply this recursive solving thing every time the opponent made an action beyond the second betting round. So, when I mention this idea of what's called Safe Subgame Solving, where we use the expected values from the other subgames, people always ask about this thing called Unsafe Subgame Solving, which is the more intuitive approach to doing this. The idea here is well, why don't we just estimate what the opponent strategy is?
Let's say we can sort of like we played a bunch of hands against them or we can estimate what the Nash equilibrium is for them, and we figured out, well, they should be choosing play 80 percent of the time when the coin lands heads and play 30 percent of the time when the coin lands tails. Let's say, just for example. Now, if we assume that the opponent's playing this strategy, can we then reason about the distribution of states that we might be in and then solve optimally using that distribution.
It turns out that doesn't work. So, let me give you an example of what this would look like. When the coin lands either heads or tails, we reason that we're in one of these states with 50-50 probability. Now if we observe player one choose play, we would say, okay, well, in a Nash equilibrium, we would expect player one to choose play 80 percent of the time if we were in the left state, 30 percent of the time when we're in the tail state.
So, we update our belief about what state we're in using Bayes rule, and now we can reason that we're in that left state with 73 percent probability, and in that right state with 27 percent probability. Now, we would just say, well, if we assume this distribution is correct, then the optimal strategy is to always choose heads. But if we've already established that's a really bad idea, because now the opponent can simply shift to selling the coin when it lands heads and choose and play with the coin lands tails. So, the problem with this approach is that we're making an assumption about how the opponent is playing.

0 Comments

Beating Top Humans in No-Limit Poker

Leave a Reply.

Author

Archives

Categories