自适应步长：通过模型置信度自动划分推理步骤

摘要

当前训练过程奖励模型（PRMs）的方法通常依赖于基于规则的技巧来分解响应为多个推理步骤，例如使用预定义的占位符标记或将推理步骤的长度固定为特定尺寸。这些方法忽视了特定词汇通常并不真正标识文本中的决策点这一事实。为此，我们提出了AdaptiveStep方法，该方法依据模型预测下一个词的置信度来划分推理步骤。这种划分方式在每个步骤中提供了更多决策信息，从而增强了如奖励模型学习等下游任务的效果。此外，我们的方法无需人工标注。通过在数学推理和代码生成任务中对采用AdaptiveStep训练的PRMs进行实验，我们验证了其有效性。实验结果显示，所得到的PRM在Best-of-N性能上达到了当前最优水平，超越了基于词级别价值引导解码的贪婪搜索策略，同时与现有开源PRMs相比，构建成本降低了超过30%。此外，我们还对PRM的性能、可迁移性及泛化能力进行了深入分析和案例研究。

English

Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.

自适应步长：通过模型置信度自动划分推理步骤

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

摘要

Summary

Support