언어 모델이 프로그래머를 대체할 수 있을까? REPOCOD는 아직 그렇지 않다고 말합니다.

초록

대형 언어 모델(LLMs)은 HumanEval 및 MBPP에서 Python 코딩 문제를 해결하는 데 90% 이상의 pass@1을 보여주며 코드 생성에서 놀라운 능력을 보였다. 이러한 높은 정확도는 LLMs가 인간 프로그래머를 대체할 수 있는지에 대한 의문을 불러일으킨다. 기존의 수동으로 제작된 간단하거나 한 줄짜리 코드 생성 벤치마크는 실제 소프트웨어 개발과의 간극으로 인해 이 질문에 대답할 수 없다. 이 질문에 대답하기 위해 우리는 REPOCOD를 제안한다. 이는 11개의 인기 있는 실제 프로젝트에서 수집한 980개의 문제로 구성된 코드 생성 벤치마크로, 이 중 58% 이상이 파일 수준 또는 저장소 수준의 컨텍스트 정보가 필요하다. 또한 REPOCOD는 기존 벤치마크와 비교했을 때 가장 긴 평균 규범적 솔루션 길이(331.6 토큰)와 가장 높은 평균 순환 복잡성(9.00)을 가지고 있다. 우리가 10개의 LLMs에 대해 실시한 평가에서는, 어떤 모델도 REPOCOD에서 30% 이상의 pass@1을 달성할 수 없었으며, 이는 실제 소프트웨어 개발에서 개발자들을 돕는 더 강력한 LLMs를 구축해야 한다는 필요성을 드러낸다.

English

Large language models (LLMs) have shown remarkable ability in code generation with more than 90 pass@1 in solving Python coding problems in HumanEval and MBPP. Such high accuracy leads to the question: can LLMs replace human programmers? Existing manual crafted, simple, or single-line code generation benchmarks cannot answer this question due to their gap with real-world software development. To answer this question, we propose REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. In our evaluations on ten LLMs, none of the models can achieve more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.

언어 모델이 프로그래머를 대체할 수 있을까? REPOCOD는 아직 그렇지 않다고 말합니다.

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

초록

Support