VAPO:面向高级推理任务的高效可靠强化学习
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
April 7, 2025
作者: YuYue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, Lin Yan
cs.AI
摘要
我們提出了VAPO(基於價值的增強近端策略優化框架),這是一個專為基於價值範式的推理模型量身定制的新穎框架。在AIME 2024數據集上進行基準測試時,基於Qwen 32B預訓練模型構建的VAPO取得了60.4的頂尖分數。在相同的實驗設置下進行直接比較,VAPO比之前報告的DeepSeek-R1-Zero-Qwen-32B和DAPO的結果高出10多分。VAPO的訓練過程以其穩定性和效率著稱,僅需5,000步即可達到頂尖性能。此外,在多次獨立運行中,未發生任何訓練崩潰,進一步凸顯了其可靠性。本研究深入探討了使用基於價值的強化學習框架進行長鏈思維(long-CoT)推理的過程。我們指出了困擾基於價值方法的三個關鍵挑戰:價值模型偏差、異質序列長度的存在以及獎勵信號的稀疏性。通過系統化設計,VAPO提供了一個綜合解決方案,有效緩解了這些挑戰,從而在長鏈思維推理任務中實現了更優的性能。
English
We present VAPO, Value-based Augmented Proximal Policy Optimization framework
for reasoning models., a novel framework tailored for reasoning models within
the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the
Qwen 32B pre-trained model, attains a state-of-the-art score of
60.4. In direct comparison under identical experimental settings,
VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B
and DAPO by more than 10 points. The training process of VAPO stands out for
its stability and efficiency. It reaches state-of-the-art performance within a
mere 5,000 steps. Moreover, across multiple independent runs, no training
crashes occur, underscoring its reliability. This research delves into long
chain-of-thought (long-CoT) reasoning using a value-based reinforcement
learning framework. We pinpoint three key challenges that plague value-based
methods: value model bias, the presence of heterogeneous sequence lengths, and
the sparsity of reward signals. Through systematic design, VAPO offers an
integrated solution that effectively alleviates these challenges, enabling
enhanced performance in long-CoT reasoning tasks.Summary
AI-Generated Summary