PRMBench:一个针对过程级奖励模型的细粒度且具有挑战性的基准测试。
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
January 6, 2025
作者: Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng
cs.AI
摘要
过程级奖励模型(PRMs)对于复杂推理和决策任务至关重要,在这些任务中,每个中间步骤在推理过程中起着重要作用。由于语言模型在推理过程中容易出现各种类型的错误,因此需要PRMs具备对真实场景中各种隐式错误类型进行检测的微妙能力。然而,当前的基准主要关注步骤的正确性,未能系统评估PRMs的性能。为了弥补这一差距,我们引入了PRMBench,这是一个专门设计用于评估PRMs微观错误检测能力的过程级基准。PRMBench包括6,216个精心设计的问题和83,456个步骤级标签,评估模型在多个维度上的表现,包括简单性、合理性和敏感性。在我们对15个模型进行的实验中,涵盖了开源PRMs和作为评论模型的封闭源大型语言模型,我们揭示了当前PRMs存在的显著弱点。这些发现突显了过程级评估中固有的挑战,并强调了未来研究的关键方向。我们希望PRMBench能成为推动PRM评估和发展研究的强大基准。
English
Process-level Reward Models (PRMs) are crucial for complex reasoning and
decision-making tasks, where each intermediate step plays an important role in
the reasoning process. Since language models are prone to various types of
errors during the reasoning process, PRMs are required to possess nuanced
capabilities for detecting various implicit error types in real-world
scenarios. However, current benchmarks primarily focus on step correctness,
failing to evaluate PRMs' performance systematically. To address this gap, we
introduce PRMBench, a process-level benchmark specifically designed to assess
the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216
carefully designed problems and 83,456 step-level labels, evaluating models
across multiple dimensions, including simplicity, soundness, and sensitivity.
In our experiments on 15 models, spanning both open-source PRMs and
closed-source large language models prompted as critic models, we uncover
significant weaknesses in current PRMs. These findings underscore the
challenges inherent in process-level evaluation and highlight key directions
for future research. We hope PRMBench can be a robust bench for advancing
research on PRM evaluation and development.Summary
AI-Generated Summary