VAPO：面向高级推理任务的高效可靠强化学习

摘要

我们提出了VAPO（基于价值的增强近端策略优化框架），这是一个专为价值范式下的推理模型量身定制的新颖框架。在AIME 2024数据集上的基准测试中，基于Qwen 32B预训练模型构建的VAPO取得了60.4的顶尖分数。在相同的实验设置下直接对比，VAPO比之前报道的DeepSeek-R1-Zero-Qwen-32B和DAPO结果高出10多分。VAPO的训练过程以其稳定性和高效性脱颖而出，仅需5,000步即可达到顶尖性能。此外，在多次独立运行中，未发生任何训练崩溃，凸显了其可靠性。本研究深入探讨了使用基于价值的强化学习框架进行长链思维（long-CoT）推理。我们指出了困扰基于价值方法的三大关键挑战：价值模型偏差、异质序列长度的存在以及奖励信号的稀疏性。通过系统化设计，VAPO提供了一个综合解决方案，有效缓解了这些挑战，从而在长链思维推理任务中实现了性能提升。

English

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of 60.4. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

VAPO：面向高级推理任务的高效可靠强化学习

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

摘要

Summary

Support

Support