ChatPaper.aiChatPaper

关于激发与改进R1类推理模型的实证研究

An Empirical Study on Eliciting and Improving R1-like Reasoning Models

March 6, 2025
作者: Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
cs.AI

摘要

在本报告中,我们发布了STILL项目中关于慢思考模型开发的第三份技术报告。随着技术路径的日益明晰,强化学习(RL)训练的规模化已成为实现此类推理模型的核心技术。我们系统性地实验并记录了影响RL训练的各种因素,对基础模型和微调模型均进行了实验。具体而言,我们展示了RL训练方法持续提升了Qwen2.5-32B基础模型的表现,不仅增加了响应长度,还提高了测试准确率。此外,即便如DeepSeek-R1-Distill-Qwen-1.5B这样已具备高水平的模型,通过RL训练仍能进一步优化,在AIME 2024上达到了39.33%的准确率。除了RL训练,我们还探索了工具操作的应用,发现其显著提升了大型推理模型的推理性能。该方法在AIME 2024上采用贪心搜索策略,取得了86.67%的惊人准确率,充分证明了其在增强模型能力方面的有效性。我们已在STILL项目网站发布相关资源:https://github.com/RUCAIBox/Slow_Thinking_with_LLMs。
English
In this report, we present the third technical report on the development of slow-thinking models as part of the STILL project. As the technical pathway becomes clearer, scaling RL training has become a central technique for implementing such reasoning models. We systematically experiment with and document the effects of various factors influencing RL training, conducting experiments on both base models and fine-tuned models. Specifically, we demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models, enhancing both response length and test accuracy. Furthermore, we show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already achieved a high performance level, it can be further refined through RL training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models. This approach achieves a remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its effectiveness in enhancing model capabilities. We release our resources at the STILL project website: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

Summary

AI-Generated Summary

PDF83March 10, 2025