ChatPaper.aiChatPaper

Open-Reasoner-Zero:一种基于基础模型扩展强化学习的开源方案

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

March 31, 2025
作者: Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
cs.AI

摘要

我们推出Open-Reasoner-Zero,这是首个专注于可扩展性、简洁性和易用性的大规模推理导向强化学习训练的开源实现。通过大量实验,我们证明了一种极简方法——即使用GAE(λ=1,γ=1)的普通PPO算法和直接的基于规则的奖励机制,无需任何KL正则化,便足以同时提升响应长度和基准测试性能,这与DeepSeek-R1-Zero中观察到的现象相似。采用与DeepSeek-R1-Zero-Qwen-32B相同的基础模型,我们的实现在AIME2024、MATH500及GPQA Diamond基准测试中均取得了更优性能,同时展现出显著的效率优势——相比DeepSeek-R1-Zero流程,仅需十分之一的训练步数。秉承开源精神,我们公开了源代码、参数设置、训练数据及不同规模的模型权重。
English
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE (lambda=1, gamma=1) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Summary

AI-Generated Summary

PDF633April 1, 2025