Light-R1:从零开始及超越的长链思维训练课程——监督微调、直接偏好优化与强化学习的综合应用
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
March 13, 2025
作者: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
cs.AI
摘要
本文介绍了我们在Light-R1系列上的研究工作,并公开了模型、数据和代码。我们首先专注于从头训练长链思维(COT)模型,特别是从最初不具备长链思维能力的模型开始。采用由两阶段监督微调(SFT)和半在线策略直接偏好优化(DPO)组成的课程训练方案,我们从Qwen2.5-32B-Instruct训练出Light-R1-32B模型,在数学性能上超越了DeepSeek-R1-Distill-Qwen-32B。尽管仅针对数学数据进行训练,Light-R1-32B在其他领域也展现出强大的泛化能力。在后续工作中,我们强调了为第二阶段SFT构建的3k数据集在提升其他模型性能上的显著优势。通过使用该数据集微调DeepSeek-R1-Distilled模型,我们在7B和14B规模上获得了新的SOTA模型,而32B模型Light-R1-32B-DS的表现与QwQ-32B和DeepSeek-R1相当。
此外,我们通过应用强化学习,特别是广义相对策略优化(GRPO),于长链思维模型上,进一步提升了推理性能。我们成功地对最终模型Light-R1-14B-DS进行了强化学习训练,在14B参数规模的数学模型中达到了SOTA水平。凭借AIME24和AIME25分别74.0和60.2的得分,Light-R1-14B-DS甚至超越了许多32B模型及DeepSeek-R1-Distill-Llama-70B。其强化学习训练还展现出预期的良好行为,即响应长度与奖励分数同步增长。
Light-R1系列工作验证了从头训练长链思维模型的可行性,展示了SFT数据处理的精妙之处,并发布了通过强化学习获得的SOTA模型。
English
This paper presents our work on the Light-R1 series, with models, data, and
code all released.
We first focus on training long COT models from scratch, specifically
starting from models initially lacking long COT capabilities. Using a
curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO,
we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in
superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite
being trained exclusively on math data, Light-R1-32B shows strong
generalization across other domains. In the subsequent phase of this work, we
highlight the significant benefit of the 3k dataset constructed for the second
SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled
models using this dataset, we obtain new SOTA models in 7B and 14B, while the
32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1.
Furthermore, we extend our work by applying reinforcement learning,
specifically GRPO, on long-COT models to further improve reasoning performance.
We successfully train our final Light-R1-14B-DS with RL, achieving SOTA
performance among 14B parameter models in math. With AIME24 & 25 scores of 74.0
and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and
DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected
behavior, showing simultaneous increase in response length and reward score.
The Light-R1 series of work validates training long-COT models from scratch,
showcases the art in SFT data and releases SOTA models from RL.Summary
AI-Generated Summary
论文概述
核心贡献
- 提出了一种从零开始训练长链思维(COT)模型的完整开源方法,包括课程学习、SFT和DPO。
- 构建了一个高质量的3k数据集,显著提升了其他模型的性能。
- 首次在14B模型上成功应用强化学习(RL),进一步提升了数学推理能力。
研究背景
- 长链思维推理在基础AI模型和工业应用中越来越受欢迎,但大规模模型的训练和部署成本高昂。
- 研究目标是开发在10B参数以下的紧凑模型,能够在数学问题解决、算法规划和科学分析中表现出色。
关键词
- 长链思维(COT)
- 课程学习
- 监督微调(SFT)
- 直接偏好优化(DPO)
- 强化学习(RL)
背景
研究空白
- 现有方法难以在资源受限的环境中训练和部署长链思维模型。
- 缺乏高效的训练数据集和优化方法。
技术挑战
- 数据集的构建和优化。
- 模型训练中的课程设计和多阶段优化。
- 强化学习在小型模型上的应用。
先前方法
- 传统的单阶段SFT方法在长链思维问题上表现有限。
- 强化学习主要应用于大规模模型,小型模型上的应用尚未成熟。
方法论
技术架构
- 采用两阶段SFT和半策略DPO的课程学习方法。
- 使用GRPO算法进行强化学习优化。
实现细节
- 数据集经过严格的去重和格式化处理。
- 使用DeepScaleR-1.5B-Preview和DeepSeek-R1-Distill-Qwen-32B模型进行难度过滤。
- 强化学习采用离线数据选择和在线优化的两阶段过程。
创新点
- 课程学习策略,逐步增加训练数据的难度。
- 高质量的3k数据集,显著提升了模型的性能。
- 首次在14B模型上成功应用强化学习,展示了RL在小型模型上的潜力。
结果
实验设置
- 使用AIME24、AIME25和GPQA Diamond等基准进行评估。
- 训练在12×H800 GPU上进行,成本约为1000美元。
关键发现
- Light-R1-32B在AIME24和AIME25上的表现超过了DeepSeek-R1-Distill-Qwen-32B。
- Light-R1-7B-DS和Light-R1-14B-DS在同等规模的模型中达到了SOTA水平。
- 强化学习显著提升了Light-R1-14B-DS的数学推理能力。
局限性
- 模型在科学和编码领域的表现有待进一步提升。
- 强化学习的训练过程仍然较为耗时。
结论
- Light-R1系列工作验证了从零开始训练长链思维模型的可行性。
- 高质量的3k数据集和强化学习的成功应用为资源受限环境中的高级推理能力提供了新的可能性。
- 未来的工作将探索增强模型的泛化能力和优化强化学习的效率。
1比特LLM时代:所有大型语言模型均为1.58比特。The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
1比特LLM时代:所有大型语言模型均为1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei•Feb 27, 2024•608142
Qwen2.5 技术报告Qwen2.5 Technical Report
Qwen2.5 技术报告
Qwen2.5 Technical Report
Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu•Dec 19, 2024•3459
DeepSeek-R1:通过强化学习激励LLMs中的推理能力DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-R1:通过强化学习激励LLMs中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang•Jan 22, 2025•3194