语言模型能取代程序员吗？REPOCOD表示“还没有”。

摘要

大型语言模型（LLMs）展现出在代码生成方面的显著能力，在HumanEval和MBPP中解决Python编程问题的pass@1率超过90％。这种高准确性引发了一个问题：LLMs能否取代人类程序员？现有的手工制作、简单或单行代码生成基准无法回答这个问题，因为它们与真实世界软件开发存在差距。为了回答这个问题，我们提出了REPOCOD，一个代码生成基准，收集了来自11个热门真实世界项目的980个问题，其中超过58％的问题需要文件级或存储库级上下文信息。此外，与现有基准相比，REPOCOD具有最长的平均规范解决方案长度（331.6个标记）和最高的平均圈复杂度（9.00）。在我们对十个LLMs的评估中，没有一个模型能在REPOCOD上实现超过30的pass@1，揭示了构建更强大的LLMs的必要性，这些模型可以帮助开发人员进行真实世界软件开发。

English

Large language models (LLMs) have shown remarkable ability in code generation with more than 90 pass@1 in solving Python coding problems in HumanEval and MBPP. Such high accuracy leads to the question: can LLMs replace human programmers? Existing manual crafted, simple, or single-line code generation benchmarks cannot answer this question due to their gap with real-world software development. To answer this question, we propose REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. In our evaluations on ten LLMs, none of the models can achieve more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.

语言模型能取代程序员吗？REPOCOD表示“还没有”。

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

摘要

Summary

Support

Support