Light-R1:从零开始及超越的长链思维训练课程——监督微调、直接偏好优化与强化学习的综合应用
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
March 13, 2025
作者: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
cs.AI
摘要
本文介绍了我们在Light-R1系列上的研究工作,并公开了模型、数据和代码。我们首先专注于从头训练长链思维(COT)模型,特别是从最初不具备长链思维能力的模型开始。采用由两阶段监督微调(SFT)和半在线策略直接偏好优化(DPO)组成的课程训练方案,我们从Qwen2.5-32B-Instruct训练出Light-R1-32B模型,在数学性能上超越了DeepSeek-R1-Distill-Qwen-32B。尽管仅针对数学数据进行训练,Light-R1-32B在其他领域也展现出强大的泛化能力。在后续工作中,我们强调了为第二阶段SFT构建的3k数据集在提升其他模型性能上的显著优势。通过使用该数据集微调DeepSeek-R1-Distilled模型,我们在7B和14B规模上获得了新的SOTA模型,而32B模型Light-R1-32B-DS的表现与QwQ-32B和DeepSeek-R1相当。
此外,我们通过应用强化学习,特别是广义相对策略优化(GRPO),于长链思维模型上,进一步提升了推理性能。我们成功地对最终模型Light-R1-14B-DS进行了强化学习训练,在14B参数规模的数学模型中达到了SOTA水平。凭借AIME24和AIME25分别74.0和60.2的得分,Light-R1-14B-DS甚至超越了许多32B模型及DeepSeek-R1-Distill-Llama-70B。其强化学习训练还展现出预期的良好行为,即响应长度与奖励分数同步增长。
Light-R1系列工作验证了从头训练长链思维模型的可行性,展示了SFT数据处理的精妙之处,并发布了通过强化学习获得的SOTA模型。
English
This paper presents our work on the Light-R1 series, with models, data, and
code all released.
We first focus on training long COT models from scratch, specifically
starting from models initially lacking long COT capabilities. Using a
curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO,
we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in
superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite
being trained exclusively on math data, Light-R1-32B shows strong
generalization across other domains. In the subsequent phase of this work, we
highlight the significant benefit of the 3k dataset constructed for the second
SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled
models using this dataset, we obtain new SOTA models in 7B and 14B, while the
32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1.
Furthermore, we extend our work by applying reinforcement learning,
specifically GRPO, on long-COT models to further improve reasoning performance.
We successfully train our final Light-R1-14B-DS with RL, achieving SOTA
performance among 14B parameter models in math. With AIME24 & 25 scores of 74.0
and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and
DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected
behavior, showing simultaneous increase in response length and reward score.
The Light-R1 series of work validates training long-COT models from scratch,
showcases the art in SFT data and releases SOTA models from RL.Summary
AI-Generated Summary