Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fully open reproduction of DeepSeek-R1

License

NotificationsYou must be signed in to change notification settings

huggingface/open-r1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!

Table of Contents

  1. Overview
  2. Plan of attack
  3. Installation
  4. Training models
  5. Evaluating models
  6. Reproducing Deepseek's evaluation results
  7. Data generation
  8. Contributing

Overview

The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:

  • src/open_r1: contains the scripts to train models as well as generate synthetic data:
    • grpo.py: trains a model with GRPO on a given dataset.
    • sft.py: performs a simple SFT of a model on a dataset.
    • generate.py: generates synthetic data from a model usingDistilabel.
  • Makefile: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.

Plan of attack

We will use the DeepSeek-R1tech report as a guide, which can roughly be broken down into three main steps:

  • Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
  • Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
  • Step 3: show we can go from base model to RL-tuned via multi-stage training.

News 🗞️

  • 🧑‍🍳 [2025/05/26] (Step 1 completed!) We releaseMixture-of-Thoughts--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to trainOpenR1-Distill-7B, which replicates the reasoning capabilities ofdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B and marks the completion of step 1 in the Open R1 project.
  • ⚡️ [2025/03/11](update #3): We release theCodeForces-CoTs dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark ofvery hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.
  • ∞ [2025/02/10](update #2): We release theOpenR1-Math-220k dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek's distilled ones.
  • 🔥 [2025/02/02](update #1): We implement the first parts of thetraining,inference, andevaluation pipelines. Let's go!

Installation

Caution

Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running withnvcc --version.

To run the code in this project, first, create a Python virtual environment using e.g.uv.To installuv, follow theUV Installation Guide.

Note

As a shortcut, runmake install to setup development libraries (spelled out below). Afterwards, if everything is setup correctly you can try out the Open-R1 models.

uv venv openr1 --python 3.11&&source openr1/bin/activate&& uv pip install --upgrade pip

Tip

For Hugging Face cluster users, addexport UV_LINK_MODE=copy to your.bashrc to suppress cache warnings fromuv

Next, install vLLM and FlashAttention:

uv pip install vllm==0.8.5.post1uv pip install setuptools&& uv pip install flash-attn --no-build-isolation

This will also install PyTorchv2.6.0 and it isvery important to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case viapip install -e .[LIST OF MODES]. For most contributors, we recommend:

GIT_LFS_SKIP_SMUDGE=1 uv pip install -e".[dev]"

Next, log into your Hugging Face and Weights and Biases accounts as follows:

huggingface-cli loginwandb login

Finally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:

git-lfs --version

If it isn't installed, run:

sudo apt-get install git-lfs

Training models

Note

The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.

We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such asopen-r1/Mixture-of-Thoughts, run:

# Train via command lineaccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \    --dataset_name open-r1/Mixture-of-Thoughts \    --dataset_config all \    --eos_token'<|im_end|>' \    --learning_rate 4.0e-5 \    --num_train_epochs 5 \    --max_seq_length 32768 \    --per_device_train_batch_size 2 \    --gradient_checkpointing \    --bf16 \    --use_liger_kernel \    --output_dir data/OpenR1-Distill-7B# Train via YAML configaccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml

Currently, the following tasks are supported:

  • Supervised Fine-Tuningsft
  • Group Relative Policy Optimizationgrpo

Tip

If you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant.

By default, these scripts will push each model to your Hugging Face Hub username, i.e.{username}/{model_name}-{task}. You can override the parameters in each YAML config by appending them to the command as follows:

# Change the base model to a smaller variantaccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml \    --model_name_or_path Qwen/Qwen3-0.6B-Base \    --hub_model_id OpenR1-Distill-0.6B \    --output_dir data/OpenR1-Distill-0.6B

If you also wish to override the Weights and Biases default settings, you can do so as follows:

accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO

🚨 WARNING 🚨

Most base models likemeta-llama/Llama-3.2-1B do not have a chat template, so we set ChatML as the default during training. However, for Qwen base models likeQwen/Qwen2.5-1.5B, a chat template is pre-defined in the tokenizer, so the EOS token must be set accordingly, e.g.

# Align EOS token with chat template for Qwen base modelsaccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \    --model_name_or_path Qwen/Qwen2.5-1.5B \+   --eos_token '<|im_end|>'    --dataset_name open-r1/Mixture-of-Thoughts \    --dataset_config all \    --learning_rate 4.0e-5 \    --num_train_epochs 1 \    --max_seq_length 32768 \    --per_device_train_batch_size 16 \    --gradient_checkpointing \    --bf16 \    --use_liger_kernel \    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill

If you wish to use a custom chat template (e.g. Llama or Gemma), then the chat template and associated EOS token must be provided:

# Align EOS token with custom chat templateaccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \    --model_name_or_path meta-llama/Llama-3.2-1B \+   --chat_template "$(cat llama_chat_template.jinja)" \+   --eos_token '<|eot_id|>' \    --dataset_name open-r1/Mixture-of-Thoughts \    --dataset_config all \    --learning_rate 4.0e-5 \    --num_train_epochs 1 \    --max_seq_length 32768 \    --per_device_train_batch_size 16 \    --gradient_checkpointing \    --bf16 \    --use_liger_kernel \    --output_dir data/Llama-3.2-1B-Open-R1-Distill

SFT distillation

We provide a recipe to reproduce the reasoning capabilities ofdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B, starting from the same base model. To do so, run:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \    src/open_r1/sft.py \    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml

The result will be a model likeopen-r1/OpenR1-Distill-7B, with the following downstream performance:

ModelAIME 2024MATH-500GPQA DiamondLiveCodeBench v5
OpenR1-Distill-7B52.789.052.839.4
DeepSeek-R1-Distill-Qwen-7B51.393.552.437.4

You can adjust the YAML config to train on a different base model or dataset.

GRPO

We use TRL'svLLM backend to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, usevllm_mode="colocate" to run vLLM in the same process as the training script:

ACCELERATE_LOG_LEVEL=info \    accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \    --vllm_mode colocate

Warning

The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the<think> and</think> tags. It also prefills the assistant response with<think> which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g.recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml.

For multi-node training on N+1 nodes, with 1 node running the vLLM server and N nodes running training, we provide an example Slurm script. For example, to run the above example on 1+1 nodes with data parallelism, run:

sbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 8 --tp 1

See theLaunching jobs on a Slurm cluster section for more details.

GRPO dataset filtering

We provide support to filter datasets by generating and computing pass rate on veriable tasks, see thisREADME

👨‍💻 Training with a code interpreter

We provide acode reward function for executing code generated by the policy during training. Currently, this reward function targets code contests likeCodeforces, where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we support multiple sandbox providers:

  1. E2B - Fast, cloud-based sandboxes with focus on Python execution
  2. Morph - Cloud-based sandboxes with broader language support - Python/JS/C++/Rust

To use the code reward function, first install the necessary dependencies:

uv pip install -e'.[code]'
E2B Provider

To use E2B sandboxes, create a.env file and add your E2B API token:

E2B_API_KEY="e2b_xxx"
Morph Provider

To use Morph, first install the morphcloud package:

pip install morphcloud

Then add your Morph API token to the.env file:

MORPH_API_KEY="YOUR_MORPH_API_KEY"

To specify which provider to use, add theprovider_type parameter in your configuration:

# For E2Bprovider_type:e2b# For Morphprovider_type:morph
Dataset Requirements

Make sure your dataset contains averification_info column with the following schema (adopted from PrimeIntellect's excellentdatasets of verifiable problems):

{"language":"python",# Morph supports more languages including C++, Java, etc."test_cases": [        {"input":"4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n","output":"1\n3\n-1\n0\n\n2\n1 2\n","type":"stdin_stdout",        }    ],}

For example, to train a smol model on Python problems, start the vLLM server:

CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-1.5B-Instruct

Then run training with:

CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \    accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes=7 \    src/open_r1/grpo.py --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml
Using Router Services

It is possible to be rate limited when too many scripts are executed on sandbox services. For both providers, we offer router scripts that can be launched on a CPU node:

For E2B:

sbatch slurm/e2b_router.slurm

For Morph:

sbatch slurm/morph_router.slurm

Then add the router URL in your training YAML config:

# For E2Be2b_router_url:1.2.3.4:8000# For Morphmorph_router_url:1.2.3.4:8000

The port should match the one used when launching the router.All training jobs can share the same router IP which will ensure parallel executions are properly managed.

Competitive Programming problems: IOI & CodeForces

We provideioi_code_reward andcf_code_reward reward functions for executing problems fromIOI andCodeForces, respectively. You can use eitherpiston or Morph (currently IOI only) as your execution provider.

Piston

To use Piston:

  1. Get piston workers running, seeslurm/piston/README.md
  2. Set your environment variablePISTON_ENDPOINTS toslurm or to a list of piston worker endpoints

For IOI:

  1. In your configuration, useioi_provider: "piston"

For CodeForces:

  1. Download the generated (hard) test cases:
# change PATH_TO_SAVE_TESTCASES. Increase --max-workers according to your machine's capacityhuggingface-cli download open-r1/codeforces --repo-type=dataset --include='generated_tests/*.parquet' --max-workers=8 --local-dir PATH_TO_SAVE_TESTCASES
  1. Save the path in .env:
CF_TESTS_FOLDER=PATH_TO_SAVE_TESTCASES
Morph

Morph is a cloud-based solution that provides sandboxed environments for running code. To use it:

  1. Install the Morph client:pip install morphcloud
  2. Add your Morph API key to the.env file:MORPH_API_KEY="your_key_here"
  3. In your configuration, useioi_provider: "morph"
Example recipes

For IOI:

See theexample recipe for how to use the IOI reward function:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \    --num_processes=7 src/open_r1/grpo.py \    --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml

For CodeForces:

sbatch --job-name=cf-grpo --nodes=2 slurm/train.slurm --model Qwen2.5-Coder-7B-Instruct --task grpo --config codeforces --accelerator zero3 --dp 8 --tp 1

Launching jobs on a Slurm cluster

If you have access to a Slurm cluster, we provide aslurm/train.slurm script that will automatically queue training jobs for you. Here's how you can use it:

sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model {model_name} --task {task} --config {config_suffix} --accelerator {accelerator}

Here{model_name} and{task} are defined as above, while{config_suffix} refers to the specific config and{accelerator} refers to the choice of 🤗 Accelerate config inrecipes/accelerate_configs. If you wish to override the default config parameters, you can provide them by appending a space-separated string like'--arg1=value1 --arg2=value2'. Here's a concrete example to run SFT on 1 node of 8 GPUs:

sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3

You can scale the number of nodes by increasing the--nodes flag.

For GRPO, we use 1 node for the vLLM server and N nodes for training. For example, to run GRPO on 1+1 nodes with mixed data and tensor parallelism, run:

sbatch --job-name=open_r1 --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 4 --tp 2

Note

The configuration inslurm/train.slurm is optimised for the Hugging Face Compute Cluster and may require tweaking to be adapted to your own compute nodes.

Customising the dataset mixture

To combine multiple datasets as a single training mixture, you can specify thedataset_mixture parameter in the YAML config file. Here's a template for how to do this:

dataset_mixture:datasets:# List of datasets to include in the mixture    -id:dataset_1# Hub dataset IDconfig:config_name_1# Name of the dataset configsplit:split_1# Split to use from the datasetcolumns:# Columns to keep        -column_1        -column_2weight:0.25# Fraction of dataset to use    -id:dataset_2config:config_name_2split:split_2columns:                          -column_1        -column_2weight:0.5seed:42# Seed for shuffling the combined datasettest_split_size:0.1# Fraction of mixture to use for a test split

Evaluating models

We uselighteval to evaluate models. For models which fit on a single GPU, run:

export VLLM_WORKER_MULTIPROC_METHOD=spawn# Required for vLLMMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODEL# AIME 2024TASK=aime24lighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR# MATH-500TASK=math_500lighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR# GPQA DiamondTASK=gpqa:diamondlighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR# LiveCodeBenchlighteval vllm$MODEL_ARGS"extended|lcb:codegeneration|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR

To increase throughput across multiple GPUs, usedata parallel as follows:

NUM_GPUS=8MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"TASK=aime24OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR

For large models which require sharding across GPUs, usetensor parallel and run:

NUM_GPUS=8MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"TASK=aime24OUTPUT_DIR=data/evals/$MODELexport VLLM_WORKER_MULTIPROC_METHOD=spawnlighteval vllm$MODEL_ARGS"lighteval|$TASK|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR

You can also launch an evaluation withmake evaluate, specifying the model, task, and optionally the parallelism technique and number of GPUs.

To evaluate on a single GPU:

make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24

To use Data Parallelism:

make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8

To use Tensor Parallelism:

make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8

Reproducing Deepseek's evaluation results

The DeepSeek-R1 paper uses sampling with 4-64 responses per query to estimatepass@1 accuracy, but does not specify the specific number of responses per benchmark. In the tables below, we estimatepass@1 accuracy with the following number of responses per query:

BenchmarkNumber of responses per query
AIME 202464
MATH-5004
GPQA Diamond8
LiveCodeBench16

Note that for benchmarks like AIME24, it is important to sample many responses as there are only 30 problems and this can introduce high variance across repeated runs. The choice of how many responses to sample per prompt likely explains the small differences between our evaluation results and those reported by DeepSeek.

AIME 2024

We are able to reproduce Deepseek's reported results on the AIME 2024 benchmark within ~1-3 standard deviations:

ModelAIME 2024 (🤗 LightEval)AIME 2024 (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B30.728.9
DeepSeek-R1-Distill-Qwen-7B50.855.5
DeepSeek-R1-Distill-Qwen-14B65.969.7
DeepSeek-R1-Distill-Qwen-32B69.772.6
DeepSeek-R1-Distill-Llama-8B43.941.7
DeepSeek-R1-Distill-Llama-70B63.070.0

To reproduce these results use the following command:

NUM_GPUS=1# Set to 8 for 32B and 70B modelsMODEL=deepseek-ai/{model_name}MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|aime24|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR

Alternatively, you can launch Slurm jobs as follows:

python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks aime24

MATH-500

We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:

ModelMATH-500 (🤗 LightEval)MATH-500 (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B83.183.9
DeepSeek-R1-Distill-Qwen-7B94.592.8
DeepSeek-R1-Distill-Qwen-14B94.193.9
DeepSeek-R1-Distill-Qwen-32B95.694.3
DeepSeek-R1-Distill-Llama-8B88.689.1
DeepSeek-R1-Distill-Llama-70B95.194.5

To reproduce these results use the following command:

export VLLM_WORKER_MULTIPROC_METHOD=spawnNUM_GPUS=1# Set to 8 for 32B and 70B modelsMODEL=deepseek-ai/{model_name}MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|math_500|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR

Alternatively, you can launch Slurm jobs as follows:

python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks math_500

GPQA Diamond

We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:

ModelGPQA Diamond (🤗 LightEval)GPQA Diamond (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B35.833.8
DeepSeek-R1-Distill-Qwen-7B50.549.1
DeepSeek-R1-Distill-Qwen-14B61.559.1
DeepSeek-R1-Distill-Qwen-32B63.162.1
DeepSeek-R1-Distill-Llama-8B46.749.0
DeepSeek-R1-Distill-Llama-70B67.465.2

To reproduce these results use the following command:

export VLLM_WORKER_MULTIPROC_METHOD=spawnNUM_GPUS=1# Set to 8 for 32B and 70B modelsMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BMODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"lighteval|gpqa:diamond|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR
python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks gpqa

LiveCodeBench

We are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:

ModelLiveCodeBench (🤗 LightEval)LiveCodeBench (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B16.116.9
DeepSeek-R1-Distill-Qwen-7B37.437.6
DeepSeek-R1-Distill-Qwen-14B51.353.1
DeepSeek-R1-Distill-Qwen-32B56.057.2
DeepSeek-R1-Distill-Llama-8B37.439.6
DeepSeek-R1-Distill-Llama-70B55.957.5

To reproduce these results use the following command:

NUM_GPUS=1# Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speedMODEL=deepseek-ai/{model_name}MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"OUTPUT_DIR=data/evals/$MODELlighteval vllm$MODEL_ARGS"extended|lcb:codegeneration|0|0" \    --use-chat-template \    --output-dir$OUTPUT_DIR
python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks lcb

Data generation

Generate data from a smol distilled R1 model

The following example can be run in 1xH100.First install the following dependencies:

uv pip install"distilabel[vllm]>=1.5.2"

Now save the following snippet into a file namedpipeline.py and run it withpython pipeline.py. It will generate 4 outputs for each of the 10 examples (change the username for the repository to your org/user name):

fromdatasetsimportload_datasetfromdistilabel.modelsimportvLLMfromdistilabel.pipelineimportPipelinefromdistilabel.steps.tasksimportTextGenerationprompt_template="""\You will be given a problem. Please reason step by step, and put your final answer within\boxed{}:{{ instruction }}"""dataset=load_dataset("AI-MO/NuminaMath-TIR",split="train").select(range(10))model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"# Exchange with another smol distilled r1withPipeline(name="distill-qwen-7b-r1",description="A pipeline to generate data from a distilled r1 model",)aspipeline:llm=vLLM(model=model_id,tokenizer=model_id,extra_kwargs={"tensor_parallel_size":1,"max_model_len":8192,        },generation_kwargs={"temperature":0.6,"max_new_tokens":8192,        },    )prompt_column="problem"text_generation=TextGeneration(llm=llm,template=prompt_template,num_generations=4,input_mappings={"instruction":prompt_column}ifprompt_columnisnotNoneelse {}    )if__name__=="__main__":distiset=pipeline.run(dataset=dataset)distiset.push_to_hub(repo_id="username/numina-deepseek-r1-qwen-7b")

Take a look at the sample dataset atHuggingFaceH4/numina-deepseek-r1-qwen-7b.

Generate data from DeepSeek-R1

To run the bigger DeepSeek-R1, we used 2 nodes, each with 8×H100 GPUs using the slurm file present in this repo atslurm/generate.slurm. First, install the dependencies:

(for now we need to install the vllm dev wheel thatfixes the R1 cuda graph capture)

pip install https://wheels.vllm.ai/221d388cc5a836fa189305785ed7e887cea8b510/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu121uv pip install"distilabel[vllm,ray,openai]>=1.5.2"

And then run the following command:

sbatch slurm/generate.slurm \    --hf-dataset AI-MO/NuminaMath-TIR \    --temperature 0.6 \    --prompt-column problem \    --model deepseek-ai/DeepSeek-R1 \    --hf-output-dataset username/r1-dataset

Note

While the job is running, you can setup an SSH tunnel through the cluster login node to access the Ray dashboard from your computer runningssh -L 8265:ray_ip_head_node:8265 <login_node>, then browsinghttp://localhost:8265

Data decontamination

Followings1: Simple test-time scaling the data can be decontaminated using the script at:scripts/decontaminate.py, which decontaminates a dataset using 8-grams and deduplicate the data. Sample run:

python scripts/decontaminate.py \    --dataset"open-r1/verifiable-coding-problems-python" \    --problem_column problem \    --cleanup

It will decontaminate against the benchmark datasets, and remove the contaminated samples afterwards. If no argument--new_dataset_name is provided, the same dataset will be reused, adding a_decontaminated. It runs against the prompt, which for this dataset is the columnproblem, but a different one can be provided.

Arguments for the script:

usage: decontaminate.py [-h] --dataset DATASET [--split SPLIT] [--ngram_size NGRAM_SIZE] [--problem_column PROBLEM_COLUMN] [--cleanup] [--new_dataset_name NEW_DATASET_NAME]options:  -h, --help            show thishelp message andexit  --dataset DATASET     Name of the dataset to checkfor contamination.  --split SPLIT         Split to checkfor contamination, defaults to`train`.  --ngram_size NGRAM_SIZE                        Size of n-grams to build, defaults to 8.  --problem_column PROBLEM_COLUMN                        Name of the column containing the problem (prompt).  --cleanup           Whether to remove the contaminated rows before pushing the dataset.  --new_dataset_name NEW_DATASET_NAME                        New namefor the dataset. If not provided, will reuse the name and add a`_decontaminated` to the name.

Contributing

Contributions are welcome. Please refer to#23.

Acknowledgements

This project is built with the collective efforts of many groups and individuals in the open AI community. We are especially grateful to the vLLM and SGLang teams for creating high-performance tooling to scale the rollouts of GRPO. We also thank the teams atOpenThoughts,Prime Intellect, andGeneral Reasoning for creating and sharing high-quality datasets for reasoning.

Citation

If you find this project is useful in your own work, please consider citing as follows:

@misc{openr1,    title = {Open R1: A fully open reproduction of DeepSeek-R1},    url = {https://github.com/huggingface/open-r1},    author = {{Hugging Face}},    month = {January},    year = {2025}}

About

Fully open reproduction of DeepSeek-R1

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp