語言模型能取代程式設計師嗎？REPOCOD 表示「尚未」。

摘要

大型語言模型（LLMs）展現出在程式碼生成方面的卓越能力，在HumanEval和MBPP中解決Python編碼問題的pass@1率超過90％。這種高準確性引發了一個問題：LLMs能否取代人類程式設計師？現有的手工製作、簡單或單行程式碼生成基準無法回答這個問題，因為它們與真實軟體開發之間存在差距。為了回答這個問題，我們提出了REPOCOD，一個程式碼生成基準，包含了來自11個熱門真實世界專案的980個問題，其中超過58％的問題需要檔案層級或存儲庫層級的上下文資訊。此外，與現有基準相比，REPOCOD具有最長的平均標準解決方案長度（331.6個標記）和最高的平均圈復雜度（9.00）。在我們對十個LLMs的評估中，沒有一個模型能在REPOCOD上實現超過30％的pass@1，顯示了建立更強大的LLMs的必要性，以幫助開發人員進行真實世界的軟體開發。

English

Large language models (LLMs) have shown remarkable ability in code generation with more than 90 pass@1 in solving Python coding problems in HumanEval and MBPP. Such high accuracy leads to the question: can LLMs replace human programmers? Existing manual crafted, simple, or single-line code generation benchmarks cannot answer this question due to their gap with real-world software development. To answer this question, we propose REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. In our evaluations on ten LLMs, none of the models can achieve more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.

語言模型能取代程式設計師嗎？REPOCOD 表示「尚未」。

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

摘要

Summary

Support

Support