无需过程标签的自由过程奖励

摘要

与其对应的结果奖励模型（ORMs）不同，评估整个响应的过程奖励模型（PRM）逐步评分推理轨迹，提供更密集和更精细的奖励。然而，训练PRM需要在每个中间步骤上注释的标签，这对手动和自动数据收集都带来了重大挑战。本文旨在解决这一挑战。从理论和实证两方面，我们展示了可以在没有额外成本的情况下获得隐式PRM，只需简单地在更便宜的响应级别标签上训练ORM。唯一的假设是将结果奖励参数化为策略和参考模型的对数似然比，这可以进行优化，而不受损失目标具体选择的影响。在实验中，我们使用各种目标实例化我们的隐式PRM，并在MATH上评估其性能。我们展示了我们的隐式PRM在使用不到1/38的训练数据的情况下优于基于强MCTS的基线\'a la Math-Shepherd。其性能可以通过多数投票进一步提高。我们进一步发现，增加指令和响应的规模有利于我们的隐式PRM，而后者带来更大的收益。特别地，我们发现，当使用交叉熵（CE）损失实例化时，我们的隐式PRM更具数据效率，并且即使仅使用每个指令一个响应进行训练，也可以持续改进生成模型，这种设置在数据极度稀缺和不平衡的情况下仍然有效。此外，指令应与下游任务相关，而响应的多样性并不会带来收益。令人惊讶的是，训练额外的Math-Shepherd步骤标签对我们仅在结果数据上训练的隐式PRM没有带来进一步改进。我们希望我们的工作能鼓励重新思考PRM训练方法，并有助于使训练PRM更易于访问。

English

Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models, which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline \'a la Math-Shepherd using less than 1/38 of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.

无需过程标签的自由过程奖励

Free Process Rewards without Process Labels

摘要

Summary

Support

Support