通过元强化微调优化测试时计算

摘要

训练模型以有效利用测试时计算资源，对于提升大语言模型（LLMs）的推理性能至关重要。当前方法主要通过基于搜索轨迹的微调或采用0/1结果奖励的强化学习（RL）来实现，但这些方法是否高效利用了测试时计算？随着计算预算的增加，这些方法能否持续扩展？本文旨在解答这些问题。我们将优化测试时计算的问题形式化为一个元强化学习（meta-RL）问题，这为合理分配测试时计算提供了理论依据。这一视角使我们能够将LLM生成的长时间输出流视为测试时运行的多个片段，并引导我们采用输出令牌上的累积遗憾作为衡量测试时计算效能的指标。类似于RL算法在训练中最佳地权衡探索与利用，最小化累积遗憾也能在令牌流中实现探索与利用的最佳平衡。尽管我们展示了现有顶尖模型并未最小化遗憾，但通过结合0/1结果奖励RL并最大化一个密集的奖励加成，可以实现这一目标。此加成是输出流中每个后续模块所取得的“进展”，通过最终成功概率的变化来量化。基于这些洞见，我们开发了元强化微调（Meta Reinforcement Fine-Tuning, MRT），这是一类新的用于优化测试时计算的微调方法。与结果奖励RL相比，MRT在数学推理任务上实现了2-3倍的性能相对提升，以及约1.5倍的令牌效率增益。

English

Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

通过元强化微调优化测试时计算

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

摘要

Summary

Support