- Notifications
You must be signed in to change notification settings - Fork2.3k
Fully open reproduction of DeepSeek-R1
License
huggingface/open-r1
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!
Table of Contents
- Overview
- Plan of attack
- Installation
- Training models
- Evaluating models
- Reproducing Deepseek's evaluation results
- Data generation
- Contributing
The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:
src/open_r1
: contains the scripts to train models as well as generate synthetic data:grpo.py
: trains a model with GRPO on a given dataset.sft.py
: performs a simple SFT of a model on a dataset.generate.py
: generates synthetic data from a model usingDistilabel.
Makefile
: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.
We will use the DeepSeek-R1tech report as a guide, which can roughly be broken down into three main steps:
- Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
- Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
- Step 3: show we can go from base model to RL-tuned via multi-stage training.
- 🧑🍳 [2025/05/26] (Step 1 completed!) We releaseMixture-of-Thoughts--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to trainOpenR1-Distill-7B, which replicates the reasoning capabilities ofdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B and marks the completion of step 1 in the Open R1 project.
- ⚡️ [2025/03/11](update #3): We release theCodeForces-CoTs dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark ofvery hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.
- ∞ [2025/02/10](update #2): We release theOpenR1-Math-220k dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek's distilled ones.
- 🔥 [2025/02/02](update #1): We implement the first parts of thetraining,inference, andevaluation pipelines. Let's go!
Caution
Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running withnvcc --version
.
To run the code in this project, first, create a Python virtual environment using e.g.uv
.To installuv
, follow theUV Installation Guide.
Note
As a shortcut, runmake install
to setup development libraries (spelled out below). Afterwards, if everything is setup correctly you can try out the Open-R1 models.
uv venv openr1 --python 3.11&&source openr1/bin/activate&& uv pip install --upgrade pip
Tip
For Hugging Face cluster users, addexport UV_LINK_MODE=copy
to your.bashrc
to suppress cache warnings fromuv
Next, install vLLM and FlashAttention:
uv pip install vllm==0.8.5.post1uv pip install setuptools&& uv pip install flash-attn --no-build-isolation
This will also install PyTorchv2.6.0
and it isvery important to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case viapip install -e .[LIST OF MODES]
. For most contributors, we recommend:
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e".[dev]"
Next, log into your Hugging Face and Weights and Biases accounts as follows:
huggingface-cli loginwandb login
Finally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:
git-lfs --version
If it isn't installed, run:
sudo apt-get install git-lfs
Note
The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.
We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such asopen-r1/Mixture-of-Thoughts, run:
# Train via command lineaccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \ --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \ --dataset_name open-r1/Mixture-of-Thoughts \ --dataset_config all \ --eos_token'<|im_end|>' \ --learning_rate 4.0e-5 \ --num_train_epochs 5 \ --max_seq_length 32768 \ --per_device_train_batch_size 2 \ --gradient_checkpointing \ --bf16 \ --use_liger_kernel \ --output_dir data/OpenR1-Distill-7B# Train via YAML configaccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \ --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
Currently, the following tasks are supported:
- Supervised Fine-Tuning
sft
- Group Relative Policy Optimization
grpo
Tip
If you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant.
By default, these scripts will push each model to your Hugging Face Hub username, i.e.{username}/{model_name}-{task}
. You can override the parameters in each YAML config by appending them to the command as follows:
# Change the base model to a smaller variantaccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \ --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml \ --model_name_or_path Qwen/Qwen3-0.6B-Base \ --hub_model_id OpenR1-Distill-0.6B \ --output_dir data/OpenR1-Distill-0.6B
If you also wish to override the Weights and Biases default settings, you can do so as follows:
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \ --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO
🚨 WARNING 🚨
Most base models likemeta-llama/Llama-3.2-1B
do not have a chat template, so we set ChatML as the default during training. However, for Qwen base models likeQwen/Qwen2.5-1.5B
, a chat template is pre-defined in the tokenizer, so the EOS token must be set accordingly, e.g.
# Align EOS token with chat template for Qwen base modelsaccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \ --model_name_or_path Qwen/Qwen2.5-1.5B \+ --eos_token '<|im_end|>' --dataset_name open-r1/Mixture-of-Thoughts \ --dataset_config all \ --learning_rate 4.0e-5 \ --num_train_epochs 1 \ --max_seq_length 32768 \ --per_device_train_batch_size 16 \ --gradient_checkpointing \ --bf16 \ --use_liger_kernel \ --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
If you wish to use a custom chat template (e.g. Llama or Gemma), then the chat template and associated EOS token must be provided:
# Align EOS token with custom chat templateaccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \ --model_name_or_path meta-llama/Llama-3.2-1B \+ --chat_template "$(cat llama_chat_template.jinja)" \+ --eos_token '<|eot_id|>' \ --dataset_name open-r1/Mixture-of-Thoughts \ --dataset_config all \ --learning_rate 4.0e-5 \ --num_train_epochs 1 \ --max_seq_length 32768 \ --per_device_train_batch_size 16 \ --gradient_checkpointing \ --bf16 \ --use_liger_kernel \ --output_dir data/Llama-3.2-1B-Open-R1-Distill
We provide a recipe to reproduce the reasoning capabilities ofdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B, starting from the same base model. To do so, run:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \ src/open_r1/sft.py \ --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
The result will be a model likeopen-r1/OpenR1-Distill-7B, with the following downstream performance:
Model | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench v5 |
---|---|---|---|---|
OpenR1-Distill-7B | 52.7 | 89.0 | 52.8 | 39.4 |
DeepSeek-R1-Distill-Qwen-7B | 51.3 | 93.5 | 52.4 | 37.4 |
You can adjust the YAML config to train on a different base model or dataset.
We use TRL'svLLM backend to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, usevllm_mode="colocate"
to run vLLM in the same process as the training script:
ACCELERATE_LOG_LEVEL=info \ accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \ src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \ --vllm_mode colocate
Warning
The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the<think>
and</think>
tags. It also prefills the assistant response with<think>
which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g.recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml.
For multi-node training on N+1 nodes, with 1 node running the vLLM server and N nodes running training, we provide an example Slurm script. For example, to run the above example on 1+1 nodes with data parallelism, run:
sbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 8 --tp 1
See theLaunching jobs on a Slurm cluster section for more details.
We provide support to filter datasets by generating and computing pass rate on veriable tasks, see thisREADME
We provide acode
reward function for executing code generated by the policy during training. Currently, this reward function targets code contests likeCodeforces, where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we support multiple sandbox providers:
- E2B - Fast, cloud-based sandboxes with focus on Python execution
- Morph - Cloud-based sandboxes with broader language support - Python/JS/C++/Rust
To use the code reward function, first install the necessary dependencies:
uv pip install -e'.[code]'
To use E2B sandboxes, create a.env
file and add your E2B API token:
E2B_API_KEY="e2b_xxx"
To use Morph, first install the morphcloud package:
pip install morphcloud
Then add your Morph API token to the.env
file:
MORPH_API_KEY="YOUR_MORPH_API_KEY"
To specify which provider to use, add theprovider_type
parameter in your configuration:
# For E2Bprovider_type:e2b# For Morphprovider_type:morph
Make sure your dataset contains averification_info
column with the following schema (adopted from PrimeIntellect's excellentdatasets of verifiable problems):
{"language":"python",# Morph supports more languages including C++, Java, etc."test_cases": [ {"input":"4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n","output":"1\n3\n-1\n0\n\n2\n1 2\n","type":"stdin_stdout", } ],}
For example, to train a smol model on Python problems, start the vLLM server:
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-1.5B-Instruct
Then run training with:
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \ accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes=7 \ src/open_r1/grpo.py --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml
It is possible to be rate limited when too many scripts are executed on sandbox services. For both providers, we offer router scripts that can be launched on a CPU node:
For E2B:
sbatch slurm/e2b_router.slurm
For Morph:
sbatch slurm/morph_router.slurm
Then add the router URL in your training YAML config:
# For E2Be2b_router_url:1.2.3.4:8000# For Morphmorph_router_url:1.2.3.4:8000
The port should match the one used when launching the router.All training jobs can share the same router IP which will ensure parallel executions are properly managed.
We provideioi_code_reward
andcf_code_reward
reward functions for executing problems fromIOI andCodeForces, respectively. You can use eitherpiston or Morph (currently IOI only) as your execution provider.
To use Piston:
- Get piston workers running, seeslurm/piston/README.md
- Set your environment variable
PISTON_ENDPOINTS
toslurm
or to a list of piston worker endpoints
For IOI:
- In your configuration, use
ioi_provider: "piston"
For CodeForces:
- Download the generated (hard) test cases:
# change PATH_TO_SAVE_TESTCASES. Increase --max-workers according to your machine's capacityhuggingface-cli download open-r1/codeforces --repo-type=dataset --include='generated_tests/*.parquet' --max-workers=8 --local-dir PATH_TO_SAVE_TESTCASES
- Save the path in .env:
CF_TESTS_FOLDER=PATH_TO_SAVE_TESTCASES
Morph is a cloud-based solution that provides sandboxed environments for running code. To use it:
- Install the Morph client:
pip install morphcloud
- Add your Morph API key to the
.env
file:MORPH_API_KEY="your_key_here"
- In your configuration, use
ioi_provider: "morph"
For IOI:
See theexample recipe for how to use the IOI reward function:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \ --num_processes=7 src/open_r1/grpo.py \ --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml
For CodeForces:
sbatch --job-name=cf-grpo --nodes=2 slurm/train.slurm --model Qwen2.5-Coder-7B-Instruct --task grpo --config codeforces --accelerator zero3 --dp 8 --tp 1
If you have access to a Slurm cluster, we provide aslurm/train.slurm
script that will automatically queue training jobs for you. Here's how you can use it:
sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model {model_name} --task {task} --config {config_suffix} --accelerator {accelerator}
Here{model_name}
and{task}
are defined as above, while{config_suffix}
refers to the specific config and{accelerator}
refers to the choice of 🤗 Accelerate config inrecipes/accelerate_configs
. If you wish to override the default config parameters, you can provide them by appending a space-separated string like'--arg1=value1 --arg2=value2'
. Here's a concrete example to run SFT on 1 node of 8 GPUs:
sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3
You can scale the number of nodes by increasing the--nodes
flag.
For GRPO, we use 1 node for the vLLM server and N nodes for training. For example, to run GRPO on 1+1 nodes with mixed data and tensor parallelism, run:
sbatch --job-name=open_r1 --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 4 --tp 2
Note
The configuration inslurm/train.slurm
is optimised for the Hugging Face Compute Cluster and may require tweaking to be adapted to your own compute nodes.
To combine multiple datasets as a single training mixture, you can specify thedataset_mixture
parameter in the YAML config file. Here's a template for how to do this:
dataset_mixture:datasets:# List of datasets to include in the mixture -id:dataset_1# Hub dataset IDconfig:config_name_1# Name of the dataset configsplit:split_1# Split to use from the datasetcolumns:# Columns to keep -column_1 -column_2weight:0.25# Fraction of dataset to use -id:dataset_2config:config_name_2split:split_2columns: -column_1 -column_2weight:0.5seed:42# Seed for shuffling the combined datasettest_split_size:0.1# Fraction of mixture to use for a test split
We uselighteval
to evaluate models. For models which fit on a single GPU, run:
export VLLM_WORKER_MULTIPROC_METHOD=spawn# Required for vLLMMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODEL# AIME 2024TASK=aime24lighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR# MATH-500TASK=math_500lighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR# GPQA DiamondTASK=gpqa:diamondlighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR# LiveCodeBenchlighteval vllm$MODEL_ARGS"extended|lcb:codegeneration|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
To increase throughput across multiple GPUs, usedata parallel as follows:
NUM_GPUS=8MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"TASK=aime24OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
For large models which require sharding across GPUs, usetensor parallel and run:
NUM_GPUS=8MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"TASK=aime24OUTPUT_DIR=data/evals/$MODELexport VLLM_WORKER_MULTIPROC_METHOD=spawnlighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
You can also launch an evaluation withmake evaluate
, specifying the model, task, and optionally the parallelism technique and number of GPUs.
To evaluate on a single GPU:
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
To use Data Parallelism:
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
To use Tensor Parallelism:
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
The DeepSeek-R1 paper uses sampling with 4-64 responses per query to estimatepass@1
accuracy, but does not specify the specific number of responses per benchmark. In the tables below, we estimatepass@1
accuracy with the following number of responses per query:
Benchmark | Number of responses per query |
---|---|
AIME 2024 | 64 |
MATH-500 | 4 |
GPQA Diamond | 8 |
LiveCodeBench | 16 |
Note that for benchmarks like AIME24, it is important to sample many responses as there are only 30 problems and this can introduce high variance across repeated runs. The choice of how many responses to sample per prompt likely explains the small differences between our evaluation results and those reported by DeepSeek.
We are able to reproduce Deepseek's reported results on the AIME 2024 benchmark within ~1-3 standard deviations:
Model | AIME 2024 (🤗 LightEval) | AIME 2024 (DeepSeek Reported) |
---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 30.7 | 28.9 |
DeepSeek-R1-Distill-Qwen-7B | 50.8 | 55.5 |
DeepSeek-R1-Distill-Qwen-14B | 65.9 | 69.7 |
DeepSeek-R1-Distill-Qwen-32B | 69.7 | 72.6 |
DeepSeek-R1-Distill-Llama-8B | 43.9 | 41.7 |
DeepSeek-R1-Distill-Llama-70B | 63.0 | 70.0 |
To reproduce these results use the following command:
NUM_GPUS=1# Set to 8 for 32B and 70B modelsMODEL=deepseek-ai/{model_name}MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|aime24|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
Alternatively, you can launch Slurm jobs as follows:
python scripts/run_benchmarks.py --model-id {model_id} --benchmarks aime24
We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:
Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 83.1 | 83.9 |
DeepSeek-R1-Distill-Qwen-7B | 94.5 | 92.8 |
DeepSeek-R1-Distill-Qwen-14B | 94.1 | 93.9 |
DeepSeek-R1-Distill-Qwen-32B | 95.6 | 94.3 |
DeepSeek-R1-Distill-Llama-8B | 88.6 | 89.1 |
DeepSeek-R1-Distill-Llama-70B | 95.1 | 94.5 |
To reproduce these results use the following command:
export VLLM_WORKER_MULTIPROC_METHOD=spawnNUM_GPUS=1# Set to 8 for 32B and 70B modelsMODEL=deepseek-ai/{model_name}MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|math_500|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
Alternatively, you can launch Slurm jobs as follows:
python scripts/run_benchmarks.py --model-id {model_id} --benchmarks math_500
We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:
Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 35.8 | 33.8 |
DeepSeek-R1-Distill-Qwen-7B | 50.5 | 49.1 |
DeepSeek-R1-Distill-Qwen-14B | 61.5 | 59.1 |
DeepSeek-R1-Distill-Qwen-32B | 63.1 | 62.1 |
DeepSeek-R1-Distill-Llama-8B | 46.7 | 49.0 |
DeepSeek-R1-Distill-Llama-70B | 67.4 | 65.2 |
To reproduce these results use the following command:
export VLLM_WORKER_MULTIPROC_METHOD=spawnNUM_GPUS=1# Set to 8 for 32B and 70B modelsMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|gpqa:diamond|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
python scripts/run_benchmarks.py --model-id {model_id} --benchmarks gpqa
We are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:
Model | LiveCodeBench (🤗 LightEval) | LiveCodeBench (DeepSeek Reported) |
---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 16.1 | 16.9 |
DeepSeek-R1-Distill-Qwen-7B | 37.4 | 37.6 |
DeepSeek-R1-Distill-Qwen-14B | 51.3 | 53.1 |
DeepSeek-R1-Distill-Qwen-32B | 56.0 | 57.2 |
DeepSeek-R1-Distill-Llama-8B | 37.4 | 39.6 |
DeepSeek-R1-Distill-Llama-70B | 55.9 | 57.5 |
To reproduce these results use the following command:
NUM_GPUS=1# Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speedMODEL=deepseek-ai/{model_name}MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"extended|lcb:codegeneration|0|0" \ --use-chat-template \ --output-dir$OUTPUT_DIR
python scripts/run_benchmarks.py --model-id {model_id} --benchmarks lcb
The following example can be run in 1xH100.First install the following dependencies:
uv pip install"distilabel[vllm]>=1.5.2"
Now save the following snippet into a file namedpipeline.py
and run it withpython pipeline.py
. It will generate 4 outputs for each of the 10 examples (change the username for the repository to your org/user name):
fromdatasetsimportload_datasetfromdistilabel.modelsimportvLLMfromdistilabel.pipelineimportPipelinefromdistilabel.steps.tasksimportTextGenerationprompt_template="""\You will be given a problem. Please reason step by step, and put your final answer within\boxed{}:{{ instruction }}"""dataset=load_dataset("AI-MO/NuminaMath-TIR",split="train").select(range(10))model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"# Exchange with another smol distilled r1withPipeline(name="distill-qwen-7b-r1",description="A pipeline to generate data from a distilled r1 model",)aspipeline:llm=vLLM(model=model_id,tokenizer=model_id,extra_kwargs={"tensor_parallel_size":1,"max_model_len":8192, },generation_kwargs={"temperature":0.6,"max_new_tokens":8192, }, )prompt_column="problem"text_generation=TextGeneration(llm=llm,template=prompt_template,num_generations=4,input_mappings={"instruction":prompt_column}ifprompt_columnisnotNoneelse {} )if__name__=="__main__":distiset=pipeline.run(dataset=dataset)distiset.push_to_hub(repo_id="username/numina-deepseek-r1-qwen-7b")
Take a look at the sample dataset atHuggingFaceH4/numina-deepseek-r1-qwen-7b.
To run the bigger DeepSeek-R1, we used 2 nodes, each with 8×H100 GPUs using the slurm file present in this repo atslurm/generate.slurm
. First, install the dependencies:
(for now we need to install the vllm dev wheel thatfixes the R1 cuda graph capture)
pip install https://wheels.vllm.ai/221d388cc5a836fa189305785ed7e887cea8b510/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu121uv pip install"distilabel[vllm,ray,openai]>=1.5.2"
And then run the following command:
sbatch slurm/generate.slurm \ --hf-dataset AI-MO/NuminaMath-TIR \ --temperature 0.6 \ --prompt-column problem \ --model deepseek-ai/DeepSeek-R1 \ --hf-output-dataset username/r1-dataset
Note
While the job is running, you can setup an SSH tunnel through the cluster login node to access the Ray dashboard from your computer runningssh -L 8265:ray_ip_head_node:8265 <login_node>
, then browsinghttp://localhost:8265
Followings1: Simple test-time scaling the data can be decontaminated using the script at:scripts/decontaminate.py, which decontaminates a dataset using 8-grams and deduplicate the data. Sample run:
python scripts/decontaminate.py \ --dataset"open-r1/verifiable-coding-problems-python" \ --problem_column problem \ --cleanup
It will decontaminate against the benchmark datasets, and remove the contaminated samples afterwards. If no argument--new_dataset_name
is provided, the same dataset will be reused, adding a_decontaminated
. It runs against the prompt, which for this dataset is the columnproblem
, but a different one can be provided.
Arguments for the script:
usage: decontaminate.py [-h] --dataset DATASET [--split SPLIT] [--ngram_size NGRAM_SIZE] [--problem_column PROBLEM_COLUMN] [--cleanup] [--new_dataset_name NEW_DATASET_NAME]options: -h, --help show thishelp message andexit --dataset DATASET Name of the dataset to checkfor contamination. --split SPLIT Split to checkfor contamination, defaults to`train`. --ngram_size NGRAM_SIZE Size of n-grams to build, defaults to 8. --problem_column PROBLEM_COLUMN Name of the column containing the problem (prompt). --cleanup Whether to remove the contaminated rows before pushing the dataset. --new_dataset_name NEW_DATASET_NAME New namefor the dataset. If not provided, will reuse the name and add a`_decontaminated` to the name.
Contributions are welcome. Please refer to#23.
This project is built with the collective efforts of many groups and individuals in the open AI community. We are especially grateful to the vLLM and SGLang teams for creating high-performance tooling to scale the rollouts of GRPO. We also thank the teams atOpenThoughts,Prime Intellect, andGeneral Reasoning for creating and sharing high-quality datasets for reasoning.
If you find this project is useful in your own work, please consider citing as follows:
@misc{openr1, title = {Open R1: A fully open reproduction of DeepSeek-R1}, url = {https://github.com/huggingface/open-r1}, author = {{Hugging Face}}, month = {January}, year = {2025}}
About
Fully open reproduction of DeepSeek-R1
Resources
License
Uh oh!
There was an error while loading.Please reload this page.