M2rc-Eval:大规模多语言存储库级代码补全评估
M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation
October 28, 2024
作者: Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Wang, Guoan Zhang, Bangyu Xiang, Wenbo Su, Bo Zheng
cs.AI
摘要
存储库级别的代码补全在软件工程领域引起了极大关注,并引入了几个基准数据集。然而,现有的存储库级别代码补全基准通常只关注有限数量的语言(<5),无法评估现有代码大型语言模型(LLMs)在不同语言之间的普遍代码智能能力。此外,现有基准通常报告不同语言的整体平均分数,忽略了不同补全场景中的细粒度能力。因此,为了促进多语言场景下代码LLMs的研究,我们提出了一个覆盖18种编程语言的大规模多语言存储库级别代码补全基准(称为M2RC-EVAL),并提供了两种类型的细粒度注释(即桶级别和语义级别)在不同的补全场景中,我们基于解析的抽象语法树获得这些注释。此外,我们还策划了一个大规模多语言指令语料库M2RC-INSTRUCT数据集,以提高现有代码LLMs的存储库级别代码补全能力。全面的实验结果证明了我们的M2RC-EVAL和M2RC-INSTRUCT的有效性。
English
Repository-level code completion has drawn great attention in software
engineering, and several benchmark datasets have been introduced. However,
existing repository-level code completion benchmarks usually focus on a limited
number of languages (<5), which cannot evaluate the general code intelligence
abilities across different languages for existing code Large Language Models
(LLMs). Besides, the existing benchmarks usually report overall average scores
of different languages, where the fine-grained abilities in different
completion scenarios are ignored. Therefore, to facilitate the research of code
LLMs in multilingual scenarios, we propose a massively multilingual
repository-level code completion benchmark covering 18 programming languages
(called M2RC-EVAL), and two types of fine-grained annotations (i.e.,
bucket-level and semantic-level) on different completion scenarios are
provided, where we obtain these annotations based on the parsed abstract syntax
tree. Moreover, we also curate a massively multilingual instruction corpora
M2RC- INSTRUCT dataset to improve the repository-level code completion
abilities of existing code LLMs. Comprehensive experimental results demonstrate
the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.Summary
AI-Generated Summary