理解R1-Zero式训练：一个批判性视角

摘要

DeepSeek-R1-Zero 研究表明，大规模强化学习（RL）无需监督微调即可直接提升大语言模型（LLMs）的推理能力。在本研究中，我们通过剖析其两大核心要素——基础模型与强化学习，对 R1-Zero 类训练方法进行了深入探讨。我们考察了包括 DeepSeek-V3-Base 在内的多种基础模型，以探究预训练特性如何影响 RL 表现。分析发现，DeepSeek-V3-Base 已展现出“顿悟时刻”，而 Qwen2.5 基础模型即便无需提示模板也展现出强大的推理能力，暗示了潜在的预训练偏差。此外，我们识别出组相对策略优化（GRPO）中的优化偏差，该偏差在训练过程中人为增加了响应长度（尤其是错误输出）。为此，我们提出了 Dr. GRPO，一种无偏优化方法，在保持推理性能的同时提升了令牌效率。基于这些洞见，我们提出了一种极简版 R1-Zero 方案，使用 7B 基础模型在 AIME 2024 上取得了 43.3% 的准确率，创下了新的技术标杆。我们的代码已发布于 https://github.com/sail-sg/understand-r1-zero。

English

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

理解R1-Zero式训练：一个批判性视角

Understanding R1-Zero-Like Training: A Critical Perspective

摘要

Summary

Support

Support