Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork4
fast + parallel AlphaZero in PyTorch
License
lowrollr/turbozero_torch
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
📣 Check out the new and improvedturbozero
written in JAXhere 📣
The TurboZero project contains vectorized, hardware-accelerated implementations of AlphaZero-esque algorithms, alongside vectorized implementations of single-player and multi-player environments. Basic training infrastructure is also included, which means models can be trained for supported environments straight out of the box. This project is similar to DeepMind'smctx, but as of now is more focused on model-based algorithms like AlphaZero rather than model-free implementations such as MuZero, and is written with PyTorch instead of JAX. Due to this focus, TurboZero includes additional features relavant to model-based algorithms, such as persisting MCTS subtrees. I hope to eventually expand this project and implemented hardware-accelerated adaptations of other RL algorithms, likeMuZero andStochastic AlphaZero.
This project has been a labor of love but is still a little rough around the edges. I've done my best to fully explain all configuration options in this file as well as in thewiki. Thewiki also provides notes on implementation and vectorization for each of the environments as well as Monte Carlo Tree Search. While as of writing this I believe the project is in a usable, useful state, I still intend to do a great deal of work expanding functionaltiy, fixing issues, and improving performance. I cannot garauntee that data models or workflows will not drastically change as the project matures.
Training reinforcement learning algorithms is notoriously compute-intensive. Oftentimes models must train for millions of episodes to reach desired performance, with each episode containing many steps and each step requiring numerous model inference calls and dynamic game-tree exploration. All of these factors contribute to RL training tasks sometimes being prohibitvely expensive, even when taking advantage of process (CPU) parallelism. However, if environments and algorithms can be implemented as a set of multi-dimensional matrix operations, this computation can be offloaded to GPUs, reaping all the benefits of GPU parallelism by training on and evaluating stacked environments in parallel. TurboZero includes implementations of simulation environments and RL algorithms that do just that.
While other common open-source implementations of AlphaZero complete training runs in days/weeks, TurboZero can complete similar tasks in minutes/hours when paired with the appropriate hardware.
Vectorized environments are available across a variety of projects at this point. TurboZero's main contribution, therefore, is its vectorized implementaiton of MCTS that supports subtree persistence, which is integrated into a feature-rich RL training pipeline with minimal effort. One direction I'd like to go in the future is integrating with 3rd-party vectorized environments, as I believe this would dramatically increase TurboZero's usefulness.
TurboZero provides vectorized implementations of the following environments:
Environment | Type | Observation Size | Policy Size | Description |
---|---|---|---|---|
Othello | Multi-Player | 2x8x8 | 65 | 2-player tile-swapping game played on an 8x8 board. also called Reversi |
2048 | Single-Player | 4x4 | 4 | Single-player numeric puzzle game |
Each environment supports the full suite of training and evaluation tools, and are implemented with GPU-acceleration in mind. Links to the environment readmes are found above, which provide information on configuration options, implementation details, and results acheived.
TurboZero supports training policy/value models via the following vectorized algorithms:
Name | Description | Hyperparameters | Paper |
---|---|---|---|
AlphaZero | DeepMind's algorithm that first famously defeated Lee Sodol in Go and has since been shown to generalize well to other games such as Chess and Shogi as well as more sophisticated tasks like code generation and video compression. | hyperparameters | Silver et al., 2017 |
LazyZero | A lazy implementation of AlphaZero that only utilizes PUCT to dictate exploration at the root node. Exploration steps instead use fixed depth rollouts sampling from the trained model policy. I wrote this as a simpler, albeit worse alternative to AlphaZero, and showed it can effectively train models to play2048 and win. | hyperparameters |
Training can be done in a Jupyter notebook, or via the command-line. In addition to environment parameters and training hyperparameters, the user may specify the number of environments to train in parallel, so that the user is able to optimize for their own hardware. SeeQuickstart for a quick guide on how to get started, orTraining for full information on configurating your training run. I also provide example configurations that I have used to train effective models for each environment.
In addition to the algorithms supporting training a policy, TurboZero also provides vectorized implementations of the following algorithms that serve as baselines to evaluate against:
Name | Description | Parameters |
---|---|---|
Greedy MCTS | MCTS using a heurisitc function to evaluate leaf nodes | parameters |
Greedy | Evaluates potential actions using a heuristic function, no tree search | parameters |
Random | Makes a random legal move | parameters |
Evaluating against these algorithms can be baked into the evaluation step of a training run, or be run independently. SeeEvaluation & Testing for the full configuration specification.
Available for multi-player environments, tournaments provide a great way to gauge the relative strength of an algorithm in relation to various opponents. This allows the user to evaluate the effectiveness of adjusting parameters of an algorithm, or analyze how effective increasing the size of a neural network is in terms of performance. In addition, tournaments allow algorithms to be compared against a large cohort of baseline algorithms. Where applicable, I provide tournament data for each environment that will allow you to test your algorithms and models against a pre-populated field.
For more about tournaments, and configuration options, see theTournaments wiki page.
Demo mode provides the option to step through a game alongside an algorithm, which can be useful as a debugging tool or simply interesting to watch. For multi-player games, demo mode allows you to playagainst an algorithm, whether it be a heuristic baseline or a trained policy. For more information, see theDemo page.
I've included aHello World Google Colab notebook that runs through all of the main features of TurboZero and lets the user train and play against their ownOthello AlphaZero model in only a few hours:
If you'd rather run TurboZero on your own machine, follow the setup instructions below.
The following commands will install poetry (dependency managaement), clone the repository, install required packages, and create a kernel for notebooks to connect to.
curl -sSL https://install.python-poetry.org | python3 - && export PATH="/root/.local/bin:$PATH" && git clone https://github.com/lowrollr/turbozero.git && cd turbozero && poetry install && poetry run python -m ipykernel install --user --name turbozero
This will allow you to have access to the proper dependencies in Jupyter Notebooks by connecting to theturbozero
kernel.
You can run scripts on the command-line by creating a shell using
poetry shell
If you'd rather not use poetry's shell, you can instead prependpoetry run
to any commands.
To get started training a simple model, you can use one of the following commands, which load example configurations I've included for demonstration purposes. These commands will train a model and run periodic evaluation steps to track progress.
python turbozero.py --verbose --mode=train --config=./example_configs/othello_tiny.yaml --logfile=./othello_tiny.log
python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log
The configuration files I've included train very small models and do not run many environments in parallel. You should be able to run this on your personal machine, but these commands will not train performant models.
If you have access to a GPU with CUDA, you can use the following commands to train slightly larger models.
python turbozero.py --verbose --gpu --mode=train --config=./example_configs/othello_mini.yaml --logfile=./othello_mini.log
python turbozero.py --verbose --gpu --mode=train --config=./example_configs/2048_mini.yaml --logfile=./2048_mini.log
With proper hardware these should not take long to train, as they are still relatively small. These commands will train on 4096 environments in parallel as opposed to 32 for the CPU configuration.
For more information on training configuration, please see theTraining wiki page.
If you'd like to evaluate an existing model, you can use--mode=test
, link a checkpoint file with--checkpoint
. For example:
python turbozero.py --verbose --mode=test --config=./example_config/my_test_config.yaml --checkpoint=./checkpoints/my_checkpoint.pt --logfile=./test.log
For more information on evaluation/testing coniguration, see theEvaluation & Testing wiki page.
To run an example tournament with some heuristic algorithms, you can run the following command:
python turbozero.py --mode=tournament --config=./example_configs/othello_tournament.yaml
Remember to use the --gpu flag here if you have one, all algorithms are hardware accelerated!
For more information on tournament coniguration, see theTournaments wiki page.
python turbozero.py --mode=demo --config=./example_configs/othello_demo.yaml
For more information on demo coniguration, see theDemo wiki page!
If you use this project and encounter an issue, error, or undesired behavior, please submit aGitHub Issue and I will do my best to resolve it as soon as I can. You may also contact me directly viahello@jacob.land
.
Contributions, improvements, and fixes are more than welcome! I've written a lot in theWiki, I hope it provides enough information to get started. For now I don't have a formal process for this, other than creating aPull Request.
If you found this work useful, please cite it with:
@software{Marshall_TurboZero_Vectorized_AlphaZero, author = {Marshall, Jacob}, title = {{TurboZero: Vectorized AlphaZero, MCTS, and Environments}}, url = {https://github.com/lowrollr/turbozero}}
About
fast + parallel AlphaZero in PyTorch
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.