FEA-Bench:面向功能实现代码生成的仓库级评估基准
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
March 9, 2025
作者: Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li
cs.AI
摘要
在代码库层面实现新功能是代码生成模型的关键应用场景。然而,现有基准测试缺乏针对这一能力的专门评估框架。为填补这一空白,我们推出了FEA-Bench,这是一个旨在评估大型语言模型(LLMs)在代码库中进行增量开发能力的基准测试。我们从83个GitHub代码库中收集了拉取请求,并采用基于规则和意图的过滤方法,构建了专注于新功能开发的任务实例。每个包含代码变更的任务实例都配有相关的单元测试文件,以确保解决方案的可验证性。该功能实现要求LLMs同时具备新组件的代码补全能力以及对代码库中其他相关部分的代码编辑能力,从而为LLMs的自动化软件工程能力提供了更全面的评估方法。实验结果表明,LLMs在FEA-Bench上的表现显著较差,凸显了此类代码库层面增量代码开发的巨大挑战。
English
Implementing new features in repository-level codebases is a crucial
application of code generation models. However, current benchmarks lack a
dedicated evaluation framework for this capability. To fill this gap, we
introduce FEA-Bench, a benchmark designed to assess the ability of large
language models (LLMs) to perform incremental development within code
repositories. We collect pull requests from 83 GitHub repositories and use
rule-based and intent-based filtering to construct task instances focused on
new feature development. Each task instance containing code changes is paired
with relevant unit test files to ensure that the solution can be verified. The
feature implementation requires LLMs to simultaneously possess code completion
capabilities for new components and code editing abilities for other relevant
parts in the code repository, providing a more comprehensive evaluation method
of LLMs' automated software engineering capabilities. Experimental results show
that LLMs perform significantly worse in the FEA-Bench, highlighting
considerable challenges in such repository-level incremental code development.Summary
AI-Generated Summary