ProcessBench:在數學推理中識別過程錯誤

ProcessBench: Identifying Process Errors in Mathematical Reasoning

December 9, 2024
作者: Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
cs.AI

摘要

由於語言模型在解決數學問題時經常出錯,因此自動識別推理過程中的錯誤變得越來越重要,以便進行可擴展的監督。本文介紹了ProcessBench,用於衡量識別數學推理中錯誤步驟的能力。它包含3,400個測試案例,主要聚焦於競賽和奧林匹亞級別的數學問題。每個測試案例都包含一個逐步解決方案,其中錯誤位置由人類專家標註。模型需要識別包含錯誤的最早步驟,或得出所有步驟均正確的結論。我們在ProcessBench上進行了廣泛評估,涉及兩種模型:過程獎勵模型(PRMs)和評論模型,對於後者,我們提示一般語言模型逐步評論每個解決步驟。我們得出兩個主要觀察結果:(1)現有的PRMs通常無法推廣到GSM8K和MATH之外更具挑戰性的數學問題。它們的表現不及評論模型(即提示的一般語言模型)和我們自己在PRM800K數據集上簡單微調的PRM。 (2)最佳的開源模型QwQ-32B-Preview,儘管仍遠遠落後於專注推理的o1-mini,但已展示出與專有模型GPT-4o競爭力的評論能力。我們希望ProcessBench能促進未來推理過程評估的研究,為語言模型的可擴展監督鋪平道路。
English
As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

Summary

AI-Generated Summary

PDF786December 10, 2024