Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Tic-Tac-Toe with MCTS
Nested Software
Nested Software

Posted on • Edited on • Originally published atnestedsoftware.com

     

Tic-Tac-Toe with MCTS

So far in this series, we've implemented tic-tac-toe withminimax andtabular Q-learning. In this article we'll use another common technique,MCTS, or Monte Carlo tree search.

Monte Carlo simulations are used in many different areas of computing. Monte Carlo is aheuristic. With a heuristic, we are not guaranteed precisely the correct or the best answer, but we can get an approximation that can often be good enough. Some problems that can't be solved analytically, or (in a reasonable amount of time) by exhausting all possibilities, become tractable if we run a bunch of simulations instead.

With MCTS, the simulations we perform are calledplayouts. A playout is a simulation of a single game, often from beginning to end, but sometimes from a given starting point to a given end point. When we get the result of the game, we update the statistics for each game position, ornode, that was visited during that playout. The idea is to run a large enough number of playouts to statistically assess the value of taking a given action from a given state. The diagram below shows a part of a single playout for tic-tac-toe:

mcts playout

In the diagram, we choose a move for each position until we reach the end of the game (we'll look at how to make this choice in the next section). In this case,X wins, so we increment the win count and the number of visits for the final position. We increment the win count becauseX's previous move that led to this position produced a win forX. For the previous position, we increment the loss count instead of the win count. That's becauseO's move that led to that position ended up as a win forX: For that particular playout, it was a losing move. We go through the move history of the playout, alternating between updating the win and loss count (and the visit count) for each node that was visited. IfO wins, then we do the same thing, incrementing the win count for the final position, and then we alternate as before. If it's a draw, then we just increment the draw count and visit count for each position in the playout. There is a similarity to the way Minimax works here, and in fact, MCTS approximates minimax as we increase the number of simulations.

Selection and Upper Confidence Bound

How do we select the moves for a given playout? It turns out there are a lot of possible approaches. A simple way to produce a playout would be to choose random moves each time until the end-of-game state, then to update the statistics as described above.

In this article, I've chosen to implementUCB-1 (upper confidence bound), a technique that has seen some significant success in machine learning in recent years. This technique of applying an upper confidence bound to MCTS is also sometimes referred to asUCT (upper confidence trees). UCT is an interesting idea used to efficiently select moves during a playout. As with epsilon-greedy (discussed in the previousarticle), the goal is to find a suitable balance betweenexploration (trying a move just to see what happens) andexploitation (visiting the node that already has a high value from previous simulations).

To calculate the upper confidence bound, we need the following information:

  • Ni: The number of visits to the parent node (the position from which we're selecting a move)
  • ni: The number of visits to a given child node (the position resulting from choosing a move)
  • wi: The number of wins for a given child node
  • c is a constant that can be adjusted. The default value is the square root of 2, or√2.

Given a position to start from, we apply the following formula for the position resulting from each possible move, then we choose the move with the highest value:

ucb1

The upper confidence bound has two terms. The first is thewinrate for a given node. For tic-tac-toe, I use the sum of wins and draws, since we know that a draw is the best possible result if neither player makes a mistake:

classNode:# other code omitted...defvalue(self):ifself.visits==0:return0success_percentage=(self.wins+self.draws)/self.visitsreturnsuccess_percentage
Enter fullscreen modeExit fullscreen mode

The second term is theexploration term. This term increases for a given node when it hasn't been visited very much relative to the number of visits to the parent node. In the numerator, we've got the natural log of the number of visits to the parent node. The denominator is the number of visits to the current child node. If we don't visit a given child node, then the increase in the number of visits to the parent node will gradually increase the exploration term over time. Given enough time, the exploration term will get high enough for that child node to be selected:

ratio numerator

If we keep increasing the number of visits to the parent node without visiting a given child node, we can see that the overall exploration term above will gradually increase. However, because it is scaled by the natural log, this increase is slow relative to the number of visits.

Each time we do visit a child node, this increments the denominator, which entails a decrease in the exploration term:

ratio denominator

Since the denominator, unlike the numerator, is not scaled down, if selecting the child node doesn't increase the winrate, then it will decrease the value of exploring that choice quite rapidly. Overall, if exploring a node has not been promising in the past, it may take a long time before that node is selected again, but it will happen eventually, assuming we run enough playouts.

This approach to the problem of exploration vs exploitation was derived fromHoeffding's inequality. For more details, see the paperUsing Confidence Bounds for Exploitation-Exploration Trade-offs, by Peter Auer.

Implementation

My implementation of MCTS for tic-tac-toe reuses theBoardCache class from the previous articles in this series. This object stores symmetrical board positions as a single key. In order to be able to take advantage of this caching, I had to make some small adjustments to my MCTS implementation. In particular, several distinct positions can lead to the same child position, for instance:

multiple parent nodes

To handle this scenario, for a given child node, I keep track of all of its distinct parent nodes: I use the sum of the visits to the parent nodes to compute Ni. Below is the core logic that the MCTS tic-tac-toe player uses to choose a move during a playout:

defcalculate_value(node_cache,parent_board,board):node=find_or_create_node(node_cache,board)node.add_parent_node(node_cache,parent_board)ifnode.visits==0:returnmath.infparent_node_visits=node.get_total_visits_for_parent_nodes()exploration_term=(math.sqrt(2.0)*math.sqrt(math.log(parent_node_visits)/node.visits))value=node.value()+exploration_termreturnvalue
Enter fullscreen modeExit fullscreen mode

If a given child node has not been visited for a playout yet, we make its value infinite. That way, from a given parent node, we give the highest priority to checking every possible child node at least once. Otherwise, we add its current success percentage to the exploration term. Note that once we've performed our playouts, we select the actual move to play in a game by just using the highest success rate from the available options.

Results

From what I understand, UCT should converge to produce the same results as minimax as the number of playouts increases. For tic-tac-toe, I found that pre-training with4,000 playouts produced results that were close to minimax, and in which the MCTS agent didn't lose any games (based on1,000 games played):

results against various opponents

In practice, I've found that MCTS is often used in an "online" mode. That is, instead of pre-training ahead of time, MCTS is implemented live. For example, the original version of AlphaGo used deep learning for move selection and position evaluation, but it also performed MCTS playouts before each move of a real (i.e. non-training) game. It used a combination of the neural network outputs and the results from playouts to decide which move to make. For tic-tac-toe, I tried doing online playouts after each move (without pre-training) and obtained good results (comparable to the table above) with200 simulations per move.

Code

The code for this project is available on github (mcts.py):

GitHub logo nestedsoftware / tictac

Experimenting with different techniques for playing tic-tac-toe

Demo project for different approaches for playing tic-tac-toe.

Code requires python 3, numpy, and pytest. For the neural network/dqn implementation (qneural.py), pytorch is required.

Create virtual environment using pipenv:

  • pipenv --site-packages

Install using pipenv:

  • pipenv shell
  • pipenv install --dev

SetPYTHONPATH to main project directory:

  • In windows, runpath.bat
  • In bash runsource path.sh

Run tests and demo:

  • Run tests:pytest
  • Run demo:python -m tictac.main
  • Run neural net demo:python -m tictac.main_qneural

Latest results:

C:\Dev\python\tictac>python -m tictac.mainPlaying random vs random-------------------------x wins: 60.10%o wins: 28.90%draw  : 11.00%Playing minimax not random vs minimax random:---------------------------------------------x wins: 0.00%o wins: 0.00%draw  : 100.00%Playing minimax random vs minimax not random:---------------------------------------------x wins: 0.00%o wins: 0.00%draw  : 100.00%Playing minimax not random vs minimax not random:-------------------------------------------------x wins: 0.00%o wins: 0.00%draw  : 100.00%Playing minimax random vs minimax random:

Related

References

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Simple things should be simple, complex things should be possible -- Alan Kay
  • Joined

More fromNested Software

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp