CodeCriticBench:面向大型语言模型的综合性代码评审基准
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
February 23, 2025
作者: Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Zhang
cs.AI
摘要
大型语言模型(LLMs)的批判能力对于其推理能力至关重要,能够提供必要的建议(例如,详细分析和建设性反馈)。因此,如何评估LLMs的批判能力已引起广泛关注,并已提出多个批判基准。然而,现有的批判基准通常存在以下局限性:(1)主要关注通用领域的多样化推理任务,对代码任务的评估不足(例如,仅涵盖代码生成任务),且查询难度相对较低(例如,CriticBench的代码查询来自Humaneval和MBPP)。(2)缺乏从不同维度进行的全面评估。为解决这些局限性,我们引入了一个名为CodeCriticBench的综合性代码批判基准。具体而言,我们的CodeCriticBench包含两种主流代码任务(即代码生成和代码问答),并涵盖不同难度级别。此外,评估协议包括针对不同特性的基础批判评估和高级批判评估,其中高级设置中设计了细粒度的评估清单。最后,我们对现有LLMs进行了广泛的实验,结果证明了CodeCriticBench的有效性。
English
The critique capacity of Large Language Models (LLMs) is essential for
reasoning abilities, which can provide necessary suggestions (e.g., detailed
analysis and constructive feedback). Therefore, how to evaluate the critique
capacity of LLMs has drawn great attention and several critique benchmarks have
been proposed. However, existing critique benchmarks usually have the following
limitations: (1). Focusing on diverse reasoning tasks in general domains and
insufficient evaluation on code tasks (e.g., only covering code generation
task), where the difficulty of queries is relatively easy (e.g., the code
queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive
evaluation from different dimensions. To address these limitations, we
introduce a holistic code critique benchmark for LLMs called CodeCriticBench.
Specifically, our CodeCriticBench includes two mainstream code tasks (i.e.,
code generation and code QA) with different difficulties. Besides, the
evaluation protocols include basic critique evaluation and advanced critique
evaluation for different characteristics, where fine-grained evaluation
checklists are well-designed for advanced settings. Finally, we conduct
extensive experimental results of existing LLMs, which show the effectiveness
of CodeCriticBench.Summary
AI-Generated Summary