沒有流程標籤的自由過程獎勵
Free Process Rewards without Process Labels
December 2, 2024
作者: Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng
cs.AI
摘要
與其對應的結果獎勵模型(ORMs)不同,評估整個回應,過程獎勵模型(PRM)逐步評分推理軌跡,提供更密集和更精細的獎勵。然而,訓練PRM需要在每個中間步驟進行標註的標籤,對於手動和自動數據收集都帶來重大挑戰。本文旨在應對這一挑戰。從理論和實證上,我們展示可以在不增加額外成本的情況下獲得一個隱式PRM,只需在更便宜的回應級標籤上訓練ORM即可。唯一的假設是將結果獎勵參數化為策略和參考模型的對數概率比,這可以進行優化,無論損失目標的具體選擇如何。在實驗中,我們使用各種目標實例化我們的隱式PRM,並在MATH上評估其性能。我們展示,我們的隱式PRM在使用不到1/38的訓練數據的情況下勝過一個強大的基於MCTS的基線 \'a la Math-Shepherd。其性能可以通過多數投票進一步提高。我們進一步發現,增加指令和回應的規模有助於我們的隱式PRM,後者帶來更大的增益。特別是,我們發現,當使用交叉熵(CE)損失實例化我們的隱式PRM時,更具數據效率,即使只用一個回應訓練,也能不斷改進生成模型,這種設置受到極端數據稀缺和不平衡的影響。此外,指令應與下游任務相關,而回應的多樣性並不會帶來增益。令人驚訝的是,訓練額外的Math-Shepherd步驟標籤對於我們只在結果數據上訓練的隱式PRM帶來進一步改進。我們希望我們的工作將鼓勵重新思考PRM訓練方法,並有助於使訓練PRM更具可及性。
English
Different from its counterpart outcome reward models (ORMs), which evaluate
the entire responses, a process reward model (PRM) scores a reasoning
trajectory step by step, providing denser and more fine grained rewards.
However, training a PRM requires labels annotated at every intermediate step,
presenting significant challenges for both manual and automatic data
collection. This paper aims to address this challenge. Both theoretically and
empirically, we show that an implicit PRM can be obtained at no
additional cost, by simply training an ORM on the cheaper response-level
labels. The only assumption is to parameterize the outcome reward as the
log-likelihood ratios of the policy and reference models, which can be
optimized regardless of the specific choice of loss objectives. In experiments,
we instantiate our implicit PRMs with various objectives and evaluate their
performance on MATH. We show that our implicit PRM outperforms a strong
MCTS-based baseline \'a la Math-Shepherd using less than 1/38 of the
training data. Its performance can be further improved with majority voting. We
further find that scaling up instructions and responses benefits our implicit
PRM, and the latter brings a larger gain. Particularly, we find that our
implicit PRM, when instantiated with the cross-entropy (CE) loss, is more
data-efficient and can keep improving generation models even when trained with
only one response per instruction, the setup that suffers from extreme data
scarcity and imbalance. Further, instructions should be relevant to downstream
tasks while the diversity of responses does not bring gains. Surprisingly,
training on extra Math-Shepherd step labels brings no further improvements to
our implicit PRM trained on only outcome data. We hope that our work will
encourage a rethinking of PRM training approaches and contribute to making
training PRMs more accessible.Summary
AI-Generated Summary