ChatPaper.aiChatPaper

ProcessBench:在数学推理中识别过程错误

ProcessBench: Identifying Process Errors in Mathematical Reasoning

December 9, 2024
作者: Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
cs.AI

摘要

由于语言模型在解决数学问题时经常会出错,因此自动识别推理过程中的错误变得越来越重要,以便进行可扩展的监督。在本文中,我们介绍了ProcessBench,用于衡量识别数学推理中错误步骤的能力。它包含3,400个测试用例,主要集中在竞赛和奥林匹克级别的数学问题上。每个测试用例都包含一个逐步解决方案,其中错误位置由人类专家标注。模型需要识别包含错误的最早步骤,或者得出所有步骤都正确的结论。我们在ProcessBench上进行了广泛评估,涉及两种类型的模型:过程奖励模型(PRMs)和评论模型,对于后者,我们提示通用语言模型逐步评论每个解决方案。我们得出两个主要观察结果:(1)现有的PRMs通常无法推广到超出GSM8K和MATH的更具挑战性的数学问题。它们在表现上不如评论模型(即提示的通用语言模型)和我们自己训练的PRM,在PRM800K数据集上直接微调。 (2)最佳的开源模型QwQ-32B-Preview,尽管仍落后于专门用于推理的o1-mini,但已经展示出与专有模型GPT-4o相竞争的评论能力。我们希望ProcessBench能促进未来推理过程评估研究,为语言模型的可扩展监督铺平道路。
English
As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

Summary

AI-Generated Summary

PDF836December 10, 2024