基于过程的自奖励语言模型

摘要

大型语言模型在各类下游任务中展现出卓越性能，并已广泛应用于多种场景。为进一步提升其表现，采用人工标注的偏好数据进行训练，但这一方法受限于人类能力的上限。为此，自奖励方法被提出，即让语言模型通过自我奖励生成训练数据。然而，现有的自奖励范式在数学推理场景中效果欠佳，甚至可能导致性能下降。本研究提出了一种基于过程的自奖励流程，该流程在自奖励范式中引入了长程思维推理、分步式LLM作为评判者以及分步偏好优化。通过迭代的基于过程的自奖励，我们的新范式成功提升了大型语言模型在多个数学推理基准测试上的表现，展现了自奖励方法在实现可能超越人类能力的语言模型推理方面的巨大潜力。

English

Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.

基于过程的自奖励语言模型

Process-based Self-Rewarding Language Models

摘要

Summary

Support

Support