ChatPaper.aiChatPaper

大型语言模型能否检测长链思维推理中的错误?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

February 26, 2025
作者: Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Wenbo Su, Bo Zheng
cs.AI

摘要

近期,o1类模型引起了广泛关注,这类模型通过生成长链思维(CoT)推理步骤,旨在提升现有大型语言模型(LLMs)的推理能力。本文中,为了深入理解这些长链CoT的质量,并评估现有LLMs对这类长链CoT的批判能力,我们引入了DeltaBench。DeltaBench包含了来自不同o1类模型(如QwQ、DeepSeek-R1)针对多种推理任务(如数学、编程、通用推理)所生成的长链CoT,用以衡量模型在检测长链CoT推理错误方面的能力。基于DeltaBench,我们首先对生成的长链CoT进行了细致分析,以揭示不同o1类模型的有效性与效率。随后,我们对现有的过程奖励模型(PRMs)及批判模型进行了广泛评估,旨在检测每个标注过程中的错误,从而探究现有PRMs和批判模型的边界与局限。最后,我们期望DeltaBench能够引导开发者更深入地理解其模型在长链CoT推理方面的能力。
English
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.

Summary

AI-Generated Summary

PDF262February 27, 2025