sail-sg/oat-zeroPublic

NotificationsYou must be signed in to change notification settings
Fork10
Star231

A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.

License

MIT license

231 stars 10 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analysis		analysis
asset		asset
data		data
training		training
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Repository files navigation

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Blog:

Installation |Experiments |Acknowledgement |Citation

One of the most inspiring results from DeepSeek-R1-Zero is the occurrence of “Aha moment” through pure reinforcement learning (RL). At the Aha moment, the model learns emergent skills such as self-reflection, which helps it to conduct in-context search to solve complex reasoning problems.

Within only a few days after R1-Zero's release, several projects independently “reproduced” R1-Zero-like training on smaller scales (e.g., 1B to 7B) and all observed the Aha moment, which is typically measured bya sudden increase of model response length. We follow their settings to scrutinize the R1-Zero-like training process, and share the following findings:

There may NOT be Aha moment in R1-Zero-like training. Instead, we found Aha moment (such as self-reflection patterns) appears at epoch 0, namely base models.
We foundSuperficial Self-Reflection (SSR) from base models' responses, in which case self-reflections do not necessarily lead to correct final answers.
We tooka closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions.

(Left) A detailed analysis on the distribution and average length of different response groups. (Right) RL curves of test reward and model response length.

Installation

Our implementation is based onOat. Run the following command to installoat:

pip install vllm==0.6.2 && pip install oat-llm

Or you could also install in "editable" mode for local development:

git clone https://github.com/sail-sg/oat.gitcd oatpip install vllm==0.6.2 && pip install -e .

Experiments

Evaluating Self-Reflection Behavior

To evaluate the self-reflection behavior of the base model, run:

#example of analyzing Qwen/Qwen2.5-Math-7Bbash analysis/script.sh Qwen/Qwen2.5-Math-7B

Simply run the script with the model name as the argument.

Reproducing Oat-Zero

To reproduce the results of Oat-Zero on Countdown task (Sec. 3.1 in ourblog), run:

bash training/run_grpo.sh

To reproduce the results in Sec. 3.2, please follow the instructions insimpleRL/train.

Acknowledgement

Oat: Used for fast reproducing the R1-Zero-like training results on the Countdown task.
SimpleRL: Used for reproducing the SimpleRL-Zero results on MATH.
TinyZero: Referring to the implementation of reward model and Countdown task.
vLLM: Used for the accelerated inference.
We thank the developers of DeepSeek, Qwen, Rho and Llama for bringing the innovative techniques and performant models in our community.

Citation

@misc{liu2025there,title={There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study},author={Liu, Zichen and Chen, Changyu and Li, Wenjun and Pang, Tianyu and Du, Chao and Lin, Min},year={2025},howpublished={\url{https://oatllm.notion.site/oat-zero}},note={Notion Blog},}

License

oat-zero is distributed under the terms of theMIT license.

About

A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.

oatllm.notion.site/oat-zero

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Installation

Experiments

Evaluating Self-Reflection Behavior

Reproducing Oat-Zero

Acknowledgement

Citation

License

About

Resources

License

Stars

Watchers

Forks

Contributors3

Languages

Movatterモバイル変換

License

sail-sg/oat-zero

Folders and files

Latest commit

History

Repository files navigation

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Installation

Experiments

Evaluating Self-Reflection Behavior

Reproducing Oat-Zero

Acknowledgement

Citation

License

About

Resources

License

Stars

Watchers

Forks

Contributors3

Languages