VAPO：面向高级推理任务的高效可靠强化学习

摘要

我們提出了VAPO（基於價值的增強近端策略優化框架），這是一個專為基於價值範式的推理模型量身定制的新穎框架。在AIME 2024數據集上進行基準測試時，基於Qwen 32B預訓練模型構建的VAPO取得了60.4的頂尖分數。在相同的實驗設置下進行直接比較，VAPO比之前報告的DeepSeek-R1-Zero-Qwen-32B和DAPO的結果高出10多分。VAPO的訓練過程以其穩定性和效率著稱，僅需5,000步即可達到頂尖性能。此外，在多次獨立運行中，未發生任何訓練崩潰，進一步凸顯了其可靠性。本研究深入探討了使用基於價值的強化學習框架進行長鏈思維（long-CoT）推理的過程。我們指出了困擾基於價值方法的三個關鍵挑戰：價值模型偏差、異質序列長度的存在以及獎勵信號的稀疏性。通過系統化設計，VAPO提供了一個綜合解決方案，有效緩解了這些挑戰，從而在長鏈思維推理任務中實現了更優的性能。

English

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of 60.4. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

VAPO：面向高级推理任务的高效可靠强化学习

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

摘要

Summary

Support

Support