CodeCriticBench：面向大型语言模型的综合性代码评审基准

摘要

大型语言模型（LLMs）的批判能力对于其推理能力至关重要，能够提供必要的建议（例如，详细分析和建设性反馈）。因此，如何评估LLMs的批判能力已引起广泛关注，并已提出多个批判基准。然而，现有的批判基准通常存在以下局限性：（1）主要关注通用领域的多样化推理任务，对代码任务的评估不足（例如，仅涵盖代码生成任务），且查询难度相对较低（例如，CriticBench的代码查询来自Humaneval和MBPP）。（2）缺乏从不同维度进行的全面评估。为解决这些局限性，我们引入了一个名为CodeCriticBench的综合性代码批判基准。具体而言，我们的CodeCriticBench包含两种主流代码任务（即代码生成和代码问答），并涵盖不同难度级别。此外，评估协议包括针对不同特性的基础批判评估和高级批判评估，其中高级设置中设计了细粒度的评估清单。最后，我们对现有LLMs进行了广泛的实验，结果证明了CodeCriticBench的有效性。

English

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

CodeCriticBench：面向大型语言模型的综合性代码评审基准

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

摘要

Summary

Support

Support