Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fast + parallel AlphaZero in PyTorch

License

NotificationsYou must be signed in to change notification settings

lowrollr/turbozero_torch

Repository files navigation

alphazero turns the tide!2048

📣 Check out the new and improvedturbozero written in JAXhere 📣

🏁 TurboZero

The TurboZero project contains vectorized, hardware-accelerated implementations of AlphaZero-esque algorithms, alongside vectorized implementations of single-player and multi-player environments. Basic training infrastructure is also included, which means models can be trained for supported environments straight out of the box. This project is similar to DeepMind'smctx, but as of now is more focused on model-based algorithms like AlphaZero rather than model-free implementations such as MuZero, and is written with PyTorch instead of JAX. Due to this focus, TurboZero includes additional features relavant to model-based algorithms, such as persisting MCTS subtrees. I hope to eventually expand this project and implemented hardware-accelerated adaptations of other RL algorithms, likeMuZero andStochastic AlphaZero.

This project has been a labor of love but is still a little rough around the edges. I've done my best to fully explain all configuration options in this file as well as in thewiki. Thewiki also provides notes on implementation and vectorization for each of the environments as well as Monte Carlo Tree Search. While as of writing this I believe the project is in a usable, useful state, I still intend to do a great deal of work expanding functionaltiy, fixing issues, and improving performance. I cannot garauntee that data models or workflows will not drastically change as the project matures.

Motivation

Training reinforcement learning algorithms is notoriously compute-intensive. Oftentimes models must train for millions of episodes to reach desired performance, with each episode containing many steps and each step requiring numerous model inference calls and dynamic game-tree exploration. All of these factors contribute to RL training tasks sometimes being prohibitvely expensive, even when taking advantage of process (CPU) parallelism. However, if environments and algorithms can be implemented as a set of multi-dimensional matrix operations, this computation can be offloaded to GPUs, reaping all the benefits of GPU parallelism by training on and evaluating stacked environments in parallel. TurboZero includes implementations of simulation environments and RL algorithms that do just that.

While other common open-source implementations of AlphaZero complete training runs in days/weeks, TurboZero can complete similar tasks in minutes/hours when paired with the appropriate hardware.

Vectorized environments are available across a variety of projects at this point. TurboZero's main contribution, therefore, is its vectorized implementaiton of MCTS that supports subtree persistence, which is integrated into a feature-rich RL training pipeline with minimal effort. One direction I'd like to go in the future is integrating with 3rd-party vectorized environments, as I believe this would dramatically increase TurboZero's usefulness.

Features

Environments

TurboZero provides vectorized implementations of the following environments:

EnvironmentTypeObservation SizePolicy SizeDescription
OthelloMulti-Player2x8x8652-player tile-swapping game played on an 8x8 board. also called Reversi
2048Single-Player4x44Single-player numeric puzzle game

Each environment supports the full suite of training and evaluation tools, and are implemented with GPU-acceleration in mind. Links to the environment readmes are found above, which provide information on configuration options, implementation details, and results acheived.

Training

TurboZero supports training policy/value models via the following vectorized algorithms:

NameDescriptionHyperparametersPaper
AlphaZeroDeepMind's algorithm that first famously defeated Lee Sodol in Go and has since been shown to generalize well to other games such as Chess and Shogi as well as more sophisticated tasks like code generation and video compression.hyperparametersSilver et al., 2017
LazyZeroA lazy implementation of AlphaZero that only utilizes PUCT to dictate exploration at the root node. Exploration steps instead use fixed depth rollouts sampling from the trained model policy. I wrote this as a simpler, albeit worse alternative to AlphaZero, and showed it can effectively train models to play2048 and win.hyperparameters

Training can be done in a Jupyter notebook, or via the command-line. In addition to environment parameters and training hyperparameters, the user may specify the number of environments to train in parallel, so that the user is able to optimize for their own hardware. SeeQuickstart for a quick guide on how to get started, orTraining for full information on configurating your training run. I also provide example configurations that I have used to train effective models for each environment.

Evaluation

In addition to the algorithms supporting training a policy, TurboZero also provides vectorized implementations of the following algorithms that serve as baselines to evaluate against:

NameDescriptionParameters
Greedy MCTSMCTS using a heurisitc function to evaluate leaf nodesparameters
GreedyEvaluates potential actions using a heuristic function, no tree searchparameters
RandomMakes a random legal moveparameters

Evaluating against these algorithms can be baked into the evaluation step of a training run, or be run independently. SeeEvaluation & Testing for the full configuration specification.

Tournaments / Calculating Elo

Available for multi-player environments, tournaments provide a great way to gauge the relative strength of an algorithm in relation to various opponents. This allows the user to evaluate the effectiveness of adjusting parameters of an algorithm, or analyze how effective increasing the size of a neural network is in terms of performance. In addition, tournaments allow algorithms to be compared against a large cohort of baseline algorithms. Where applicable, I provide tournament data for each environment that will allow you to test your algorithms and models against a pre-populated field.

For more about tournaments, and configuration options, see theTournaments wiki page.

Demo

Demo mode provides the option to step through a game alongside an algorithm, which can be useful as a debugging tool or simply interesting to watch. For multi-player games, demo mode allows you to playagainst an algorithm, whether it be a heuristic baseline or a trained policy. For more information, see theDemo page.

Quickstart

Google Colab

I've included aHello World Google Colab notebook that runs through all of the main features of TurboZero and lets the user train and play against their ownOthello AlphaZero model in only a few hours:Open In Colab

If you'd rather run TurboZero on your own machine, follow the setup instructions below.

Setup

The following commands will install poetry (dependency managaement), clone the repository, install required packages, and create a kernel for notebooks to connect to.

curl -sSL https://install.python-poetry.org | python3 - && export PATH="/root/.local/bin:$PATH" && git clone https://github.com/lowrollr/turbozero.git && cd turbozero && poetry install && poetry run python -m ipykernel install --user --name turbozero

This will allow you to have access to the proper dependencies in Jupyter Notebooks by connecting to theturbozero kernel.

You can run scripts on the command-line by creating a shell using

poetry shell

If you'd rather not use poetry's shell, you can instead prependpoetry run to any commands.

Training

To get started training a simple model, you can use one of the following commands, which load example configurations I've included for demonstration purposes. These commands will train a model and run periodic evaluation steps to track progress.

AlphaZero for Othello (CPU)
python turbozero.py --verbose --mode=train --config=./example_configs/othello_tiny.yaml --logfile=./othello_tiny.log
LazyZero for 2048 (CPU)
python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log

The configuration files I've included train very small models and do not run many environments in parallel. You should be able to run this on your personal machine, but these commands will not train performant models.

If you have access to a GPU with CUDA, you can use the following commands to train slightly larger models.

AlphaZero for Othello (GPU)
python turbozero.py --verbose --gpu --mode=train --config=./example_configs/othello_mini.yaml --logfile=./othello_mini.log
LazyZero for 2048 (GPU)
python turbozero.py --verbose --gpu --mode=train --config=./example_configs/2048_mini.yaml --logfile=./2048_mini.log

With proper hardware these should not take long to train, as they are still relatively small. These commands will train on 4096 environments in parallel as opposed to 32 for the CPU configuration.

For more information on training configuration, please see theTraining wiki page.

Evaluation

If you'd like to evaluate an existing model, you can use--mode=test, link a checkpoint file with--checkpoint. For example:

python turbozero.py --verbose --mode=test --config=./example_config/my_test_config.yaml --checkpoint=./checkpoints/my_checkpoint.pt --logfile=./test.log

For more information on evaluation/testing coniguration, see theEvaluation & Testing wiki page.

Tournament

To run an example tournament with some heuristic algorithms, you can run the following command:

python turbozero.py --mode=tournament --config=./example_configs/othello_tournament.yaml

Remember to use the --gpu flag here if you have one, all algorithms are hardware accelerated!

For more information on tournament coniguration, see theTournaments wiki page.

Demo

python turbozero.py --mode=demo --config=./example_configs/othello_demo.yaml

For more information on demo coniguration, see theDemo wiki page!

Issues

If you use this project and encounter an issue, error, or undesired behavior, please submit aGitHub Issue and I will do my best to resolve it as soon as I can. You may also contact me directly viahello@jacob.land.

Contributing

Contributions, improvements, and fixes are more than welcome! I've written a lot in theWiki, I hope it provides enough information to get started. For now I don't have a formal process for this, other than creating aPull Request.

Cite This Work

If you found this work useful, please cite it with:

@software{Marshall_TurboZero_Vectorized_AlphaZero,  author = {Marshall, Jacob},  title = {{TurboZero: Vectorized AlphaZero, MCTS, and Environments}},  url = {https://github.com/lowrollr/turbozero}}

Sponsor this project

 

[8]ページ先頭

©2009-2025 Movatter.jp