SURGE：大型语言模型作为通用代理代码执行器的潜力

摘要

大型语言模型（LLMs）在与代码相关的任务中展示了卓越的能力，如代码理解和代码生成。然而，一个同样重要但鲜为人知的问题是LLMs是否可以作为通用代理代码执行器，预测程序的输出和行为而无需实际运行。为了系统地探究这一能力，我们引入了SURGE，一个包含八个关键方面的全面基准：多语言编程任务、竞赛级编程问题、存储库级代码分析、高成本科学计算、时间复杂度密集型算法、错误代码分析、依赖特定编译器或执行环境的程序，以及形式化数学证明验证。我们在SURGE上评估了多个开源和专有LLMs，并进行了一个规模化研究，分析了模型大小和训练数据规模对代理执行准确性的影响。此外，我们对模型预测错误进行分类，并探讨了改进的潜在领域。我们的研究结果表明，虽然LLMs在某些情况下可以预测代码执行结果，但它们在通用代理执行方面存在局限性。这项研究为使用LLMs作为代理代码执行器的可行性提供了实证见解。代码和数据集已发布在https://github.com/Imbernoulli/SURGE。

English

Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.

SURGE：大型语言模型作为通用代理代码执行器的潜力

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

摘要

Summary

Support