软件工程中AI模型的基准测试：综述、搜索工具与优化方案

摘要

基准测试对于确保评估的一致性和结果的可复现性至关重要。人工智能与软件工程的融合（AI4SE）催生了众多针对代码生成和缺陷修复等任务的基准测试。然而，这一激增也带来了挑战：(1) 基准测试知识分散于不同任务中，(2) 选择相关基准测试的难度增加，(3) 缺乏统一的基准测试开发标准，以及(4) 现有基准测试的局限性。本文回顾了173项研究，识别出204个AI4SE基准测试。我们对这些基准测试进行了分类，分析了它们的局限性，并揭示了实践中的空白。基于此，我们开发了BenchScout，一个通过自动聚类相关研究上下文来寻找相关基准测试的语义搜索工具。我们邀请了22名参与者进行用户研究，评估BenchScout的可用性、有效性和直观性，平均得分分别为4.5、4.0和4.1（满分5分）。为了推进基准测试标准，我们提出了BenchFrame，一种提升基准测试质量的统一方法。作为案例研究，我们将BenchFrame应用于HumanEval基准测试，解决了其主要限制，从而产生了HumanEvalNext，其特点包括：(1) 错误修正，(2) 语言转换优化，(3) 测试覆盖范围扩大，以及(4) 难度提升。随后，我们在HumanEval、HumanEvalPlus和HumanEvalNext上评估了十种最先进的代码语言模型。在HumanEvalNext上，模型的pass@1得分相较于HumanEval和HumanEvalPlus分别降低了31.22%和19.94%。

English

Benchmarks are essential for consistent evaluation and reproducibility. The integration of Artificial Intelligence into Software Engineering (AI4SE) has given rise to numerous benchmarks for tasks such as code generation and bug fixing. However, this surge presents challenges: (1) scattered benchmark knowledge across tasks, (2) difficulty in selecting relevant benchmarks, (3) the absence of a uniform standard for benchmark development, and (4) limitations of existing benchmarks. In this paper, we review 173 studies and identify 204 AI4SE benchmarks. We classify these benchmarks, analyze their limitations, and expose gaps in practices. Based on our review, we created BenchScout, a semantic search tool to find relevant benchmarks, using automated clustering of the contexts from associated studies. We conducted a user study with 22 participants to evaluate BenchScout's usability, effectiveness, and intuitiveness which resulted in average scores of 4.5, 4.0, and 4.1 out of 5. To advance benchmarking standards, we propose BenchFrame, a unified method to enhance benchmark quality. As a case study, we applied BenchFrame to the HumanEval benchmark and addressed its main limitations. This led to HumanEvalNext, featuring (1) corrected errors, (2) improved language conversion, (3) expanded test coverage, and (4) increased difficulty. We then evaluated ten state-of-the-art code language models on HumanEval, HumanEvalPlus, and HumanEvalNext. On HumanEvalNext, models showed a pass@1 score reduction of 31.22% and 19.94% compared to HumanEval and HumanEvalPlus, respectively.

软件工程中AI模型的基准测试：综述、搜索工具与优化方案

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

摘要

Summary

Support