Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.

License

NotificationsYou must be signed in to change notification settings

sail-sg/oat-zero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Blog:

Installation |Experiments |Acknowledgement |Citation


One of the most inspiring results from DeepSeek-R1-Zero is the occurrence of “Aha moment” through pure reinforcement learning (RL). At the Aha moment, the model learns emergent skills such as self-reflection, which helps it to conduct in-context search to solve complex reasoning problems.

Within only a few days after R1-Zero's release, several projects independently “reproduced” R1-Zero-like training on smaller scales (e.g., 1B to 7B) and all observed the Aha moment, which is typically measured bya sudden increase of model response length. We follow their settings to scrutinize the R1-Zero-like training process, and share the following findings:

  1. There may NOT be Aha moment in R1-Zero-like training. Instead, we found Aha moment (such as self-reflection patterns) appears at epoch 0, namely base models.
  2. We foundSuperficial Self-Reflection (SSR) from base models' responses, in which case self-reflections do not necessarily lead to correct final answers.
  3. We tooka closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions.

oat-zero-results

(Left) A detailed analysis on the distribution and average length of different response groups. (Right) RL curves of test reward and model response length.

Installation

Our implementation is based onOat. Run the following command to installoat:

pip install vllm==0.6.2 && pip install oat-llm

Or you could also install in "editable" mode for local development:

git clone https://github.com/sail-sg/oat.gitcd oatpip install vllm==0.6.2 && pip install -e .

Experiments

Evaluating Self-Reflection Behavior

To evaluate the self-reflection behavior of the base model, run:

#example of analyzing Qwen/Qwen2.5-Math-7Bbash analysis/script.sh Qwen/Qwen2.5-Math-7B

Simply run the script with the model name as the argument.

Reproducing Oat-Zero

To reproduce the results of Oat-Zero on Countdown task (Sec. 3.1 in ourblog), run:

bash training/run_grpo.sh

To reproduce the results in Sec. 3.2, please follow the instructions insimpleRL/train.

Acknowledgement

  • Oat: Used for fast reproducing the R1-Zero-like training results on the Countdown task.
  • SimpleRL: Used for reproducing the SimpleRL-Zero results on MATH.
  • TinyZero: Referring to the implementation of reward model and Countdown task.
  • vLLM: Used for the accelerated inference.
  • We thank the developers of DeepSeek, Qwen, Rho and Llama for bringing the innovative techniques and performant models in our community.

Citation

@misc{liu2025there,title={There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study},author={Liu, Zichen and Chen, Changyu and Li, Wenjun and Pang, Tianyu and Du, Chao and Lin, Min},year={2025},howpublished={\url{https://oatllm.notion.site/oat-zero}},note={Notion Blog},}

License

oat-zero is distributed under the terms of theMIT license.

About

A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.

Resources

License

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp