ChatPaper.aiChatPaper

RealCritic:走向以效果为驱动的语言模型评估

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

January 24, 2025
作者: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
cs.AI

摘要

批评对于提升大型语言模型(LLMs)的性能至关重要,它不仅能够促进自我改进,还能通过识别缺陷并提出改进建议,为他人提供建设性反馈。然而,评估LLMs的批评能力面临着重要挑战,这是由于任务的开放性质所致。在这项工作中,我们引入了一个旨在评估LLMs批评能力的新基准。与现有基准不同,现有基准通常以开环方式运行,我们的方法采用了一种闭环方法,评估从批评中生成的更正的质量。此外,该基准还融合了自我批评、交叉批评和迭代批评等特性,这些特性对于区分先进推理模型和更传统模型的能力至关重要。我们使用八项具有挑战性的推理任务来实施这一基准。我们得出了一些有趣的发现。首先,尽管在直接思维链生成方面表现出可比性,但在所有批评场景中,传统LLMs在性能上明显落后于基于先进推理的o1-mini模型。其次,在自我批评和迭代批评设置中,传统LLMs甚至可能表现不及其基准能力。我们希望这一基准能够成为指导未来进展的宝贵资源。代码和数据可在https://github.com/tangzhy/RealCritic 上获取。
English
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at https://github.com/tangzhy/RealCritic.

Summary

AI-Generated Summary

PDF332January 27, 2025