Kimi k1.5:使用LLMs扩展强化学习
Kimi k1.5: Scaling Reinforcement Learning with LLMs
January 22, 2025
作者: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang
cs.AI
摘要
使用下一个标记预测的语言模型预训练已被证明对扩展计算效果显著,但受可用训练数据量的限制。扩展强化学习(RL)开启了人工智能持续改进的新维度,承诺大型语言模型(LLMs)可以通过学习探索奖励来扩展其训练数据。然而,先前发表的工作并未取得竞争性结果。鉴此,我们报告了Kimi k1.5的训练实践,这是我们最新的多模态LLM,使用RL进行训练,包括其RL训练技术、多模态数据配方和基础设施优化。长上下文扩展和改进的策略优化方法是我们方法的关键要素,建立了一个简单而有效的RL框架,无需依赖于更复杂的技术,如蒙特卡洛树搜索、价值函数和过程奖励模型。值得注意的是,我们的系统在多个基准测试和模态下实现了最先进的推理性能,例如在AIME上达到了77.5,在MATH 500上达到了96.2,在Codeforces上达到了94th百分位,在MathVista上达到了74.9,与OpenAI的o1相匹配。此外,我们提出了有效的长2短方法,利用长-CoT技术改进短-CoT模型,产生了最先进的短-CoT推理结果,例如在AIME上达到了60.8,在MATH500上达到了94.6,在LiveCodeBench上达到了47.3,远远超过现有的短-CoT模型,如GPT-4o和Claude Sonnet 3.5,差距高达+550%。
English
Language model pretraining with next token prediction has proved effective
for scaling compute but is limited to the amount of available training data.
Scaling reinforcement learning (RL) unlocks a new axis for the continued
improvement of artificial intelligence, with the promise that large language
models (LLMs) can scale their training data by learning to explore with
rewards. However, prior published work has not produced competitive results. In
light of this, we report on the training practice of Kimi k1.5, our latest
multi-modal LLM trained with RL, including its RL training techniques,
multi-modal data recipes, and infrastructure optimization. Long context scaling
and improved policy optimization methods are key ingredients of our approach,
which establishes a simplistic, effective RL framework without relying on more
complex techniques such as Monte Carlo tree search, value functions, and
process reward models. Notably, our system achieves state-of-the-art reasoning
performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME,
96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching
OpenAI's o1. Moreover, we present effective long2short methods that use
long-CoT techniques to improve short-CoT models, yielding state-of-the-art
short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on
LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and
Claude Sonnet 3.5 by a large margin (up to +550%).Summary
AI-Generated Summary