PRMBench:一個細粒度且具挑戰性的流程級獎勵模型基準測試

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

January 6, 2025
作者: Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng
cs.AI

摘要

在複雜的推理和決策任務中,過程級獎勵模型(PRMs)對於每個中間步驟在推理過程中扮演重要角色至關重要。由於語言模型在推理過程中容易出現各種類型的錯誤,因此需要PRMs具備細緻的能力,以檢測現實場景中各種隱含錯誤類型。然而,目前的基準主要集中在步驟的正確性上,未能系統性評估PRMs的性能。為彌補這一差距,我們引入了PRMBench,這是一個專門設計用於評估PRMs細粒度錯誤檢測能力的過程級基準。PRMBench包含6,216個精心設計的問題和83,456個步驟級標籤,評估模型在多個維度上的表現,包括簡單性、合理性和靈敏度。在我們對15個模型的實驗中,涵蓋了開源PRMs和作為評論模型的封閉源大型語言模型,我們揭示了目前PRMs存在的重大弱點。這些發現突顯了過程級評估中固有的挑戰,並強調了未來研究的重要方向。我們希望PRMBench能成為推動PRM評估和發展研究的堅固基準。
English
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.

Summary

AI-Generated Summary

PDF142January 8, 2025