自适应步长:通过模型置信度自动划分推理步骤
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
February 19, 2025
作者: Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin
cs.AI
摘要
当前训练过程奖励模型(PRMs)的方法通常依赖于基于规则的技巧来分解响应为多个推理步骤,例如使用预定义的占位符标记或将推理步骤的长度固定为特定尺寸。这些方法忽视了特定词汇通常并不真正标识文本中的决策点这一事实。为此,我们提出了AdaptiveStep方法,该方法依据模型预测下一个词的置信度来划分推理步骤。这种划分方式在每个步骤中提供了更多决策信息,从而增强了如奖励模型学习等下游任务的效果。此外,我们的方法无需人工标注。通过在数学推理和代码生成任务中对采用AdaptiveStep训练的PRMs进行实验,我们验证了其有效性。实验结果显示,所得到的PRM在Best-of-N性能上达到了当前最优水平,超越了基于词级别价值引导解码的贪婪搜索策略,同时与现有开源PRMs相比,构建成本降低了超过30%。此外,我们还对PRM的性能、可迁移性及泛化能力进行了深入分析和案例研究。
English
Current approaches for training Process Reward Models (PRMs) often involve
breaking down responses into multiple reasoning steps using rule-based
techniques, such as using predefined placeholder tokens or setting the
reasoning step's length into a fixed size. These approaches overlook the fact
that specific words do not typically mark true decision points in a text. To
address this, we propose AdaptiveStep, a method that divides reasoning steps
based on the model's confidence in predicting the next word. This division
method provides more decision-making information at each step, enhancing
downstream tasks, such as reward model learning. Moreover, our method does not
require manual annotation. We demonstrate its effectiveness through experiments
with AdaptiveStep-trained PRMs in mathematical reasoning and code generation
tasks. Experimental results indicate that the outcome PRM achieves
state-of-the-art Best-of-N performance, surpassing greedy search strategy with
token-level value-guided decoding, while also reducing construction costs by
over 30% compared to existing open-source PRMs. In addition, we provide a
thorough analysis and case study on the PRM's performance, transferability, and
generalization capabilities.Summary
AI-Generated Summary