There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Blog:

Installation | Experiments | Acknowledgement | Citation

One of the most inspiring results from DeepSeek-R1-Zero is the occurrence of “Aha moment” through pure reinforcement learning (RL). At the Aha moment, the model learns emergent skills such as self-reflection, which helps it to conduct in-context search to solve complex reasoning problems.

Within only a few days after R1-Zero's release, several projects independently “reproduced” R1-Zero-like training on smaller scales (e.g., 1B to 7B) and all observed the Aha moment, which is typically measured by a sudden increase of model response length. We follow their settings to scrutinize the R1-Zero-like training process, and share the following findings:

There may NOT be Aha moment in R1-Zero-like training. Instead, we found Aha moment (such as self-reflection patterns) appears at epoch 0, namely base models.
We found Superficial Self-Reflection (SSR) from base models' responses, in which case self-reflections do not necessarily lead to correct final answers.
We took a closer look at R1-Zero-like training via RL, and found that the increasing response length phenomenon is not due to the emergence of self-reflection, but a consequence of RL optimizing well-designed rule-based reward functions.

(Left) A detailed analysis on the distribution and average length of different response groups. (Right) RL curves of test reward and model response length.

Installation

Our implementation is based on Oat. Run the following command to install oat:

pip install vllm==0.6.2 && pip install oat-llm

Or you could also install in "editable" mode for local development:

git clone https://github.com/sail-sg/oat.git
cd oat
pip install vllm==0.6.2 && pip install -e .

Experiments

Evaluating Self-Reflection Behavior

To evaluate the self-reflection behavior of the base model, run:

# example of analyzing Qwen/Qwen2.5-Math-7B
bash analysis/script.sh Qwen/Qwen2.5-Math-7B

Simply run the script with the model name as the argument.

Reproducing Oat-Zero

To reproduce the results of Oat-Zero on Countdown task (Sec. 3.1 in our blog), run:

bash training/run_grpo.sh

To reproduce the results in Sec. 3.2, please follow the instructions in simpleRL/train.

Acknowledgement

Oat: Used for fast reproducing the R1-Zero-like training results on the Countdown task.
SimpleRL: Used for reproducing the SimpleRL-Zero results on MATH.
TinyZero: Referring to the implementation of reward model and Countdown task.
vLLM: Used for the accelerated inference.
We thank the developers of DeepSeek, Qwen, Rho and Llama for bringing the innovative techniques and performant models in our community.

Citation

@misc{liu2025oatzero,
  title={There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study},
  author={Zichen Liu and Changyu Chen and Wenjun Li and Tianyu Pang and Chao Du and Min Lin},
  year={2025},
  howpublished={\url{https://oatllm.notion.site/oat-zero}},
  note={Notion Blog},
}

License

oat-zero is distributed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
analysis		analysis
asset		asset
data		data
training		training
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Installation

Experiments

Evaluating Self-Reflection Behavior

Reproducing Oat-Zero

Acknowledgement

Citation

License

About

Contributors 2

Languages

License

sail-sg/oat-zero

Folders and files

Latest commit

History

Repository files navigation

There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study

Installation

Experiments

Evaluating Self-Reflection Behavior

Reproducing Oat-Zero

Acknowledgement

Citation

License

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages