SURGE: 대형 언어 모델의 일반 목적 대리 코드 실행기로의 잠재력

초록

대형 언어 모델(LLMs)은 코드 이해와 코드 생성과 같은 코드 관련 작업에서 놀라운 능력을 보여주었습니다. 그러나 똑같이 중요하지만 미개척된 질문은 LLMs가 프로그램을 실제로 실행하지 않고도 프로그램의 출력과 동작을 예측하는 일반 목적의 대리 코드 실행자로서 기능할 수 있는지 여부입니다. 이 능력을 체계적으로 조사하기 위해 우리는 여덟 가지 주요 측면을 다루는 포괄적인 벤치마크인 SURGE를 소개합니다: 다국어 프로그래밍 작업, 경쟁 수준의 프로그래밍 문제, 저장소 수준의 코드 분석, 고비용 과학 계산, 시간 복잡도 집중적인 알고리즘, 버그가 있는 코드 분석, 특정 컴파일러나 실행 환경에 의존하는 프로그램, 그리고 공식 수학적 증명 검증. 우리는 여러 오픈 소스 및 프로프리어터리 LLMs를 SURGE에서 평가하고 모델 크기와 훈련 데이터 규모가 대리 실행 정확도에 미치는 영향을 분석하기 위한 스케일링 연구를 실시합니다. 또한 모델 예측 오류를 분류하고 개선 가능한 영역을 탐색합니다. 우리의 연구 결과는 LLMs가 특정 경우에는 코드 실행 결과를 예측할 수 있지만, 일반 목적의 대리 실행에서 한계를 보인다는 것을 보여줍니다. 이 연구는 LLMs를 대리 코드 실행자로 사용할 수 있는 가능성에 대한 경험적인 통찰을 제공합니다. 코드와 데이터셋은 https://github.com/Imbernoulli/SURGE에서 공개되었습니다.

English

Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.

SURGE: 대형 언어 모델의 일반 목적 대리 코드 실행기로의 잠재력

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

초록

Support