代码猿：为软件工程扩展测试时间计算

摘要

在提升大型语言模型（LLM）能力方面，扩展测试时计算是一个有前途的方向。然而，测试时计算可以通过多种方式扩展，有效地结合不同方法仍然是一个活跃的研究领域。在这里，我们在解决来自SWE-bench数据集的真实GitHub问题的背景下探讨了这个问题。我们的系统名为CodeMonkeys，允许模型通过同时生成和运行测试脚本来迭代地编辑代码库以及其草稿编辑。我们对每个问题采样许多这样的多轮轨迹，以生成候选编辑的集合。这种方法让我们通过增加每个轨迹的迭代次数来扩展“串行”测试时计算，通过增加每个问题的轨迹数量来扩展“并行”测试时计算。通过并行扩展，我们可以在多个下游样本中分摊前期成本，使我们能够使用简单的方法让LLM读取每个文件来识别相关的代码库上下文。为了在候选编辑之间进行选择，我们结合使用模型生成的测试进行投票，并使用最终的多轮轨迹专门用于选择。总体而言，CodeMonkeys在使用约2300美元的预算解决了SWE-bench验证的57.4%问题。我们的选择方法也可用于合并来自不同来源的候选者。从现有顶级SWE-bench验证提交的编辑集合中进行选择，获得了66.2%的分数，并且在单独使用时优于集合的最佳成员。我们完全公开了我们的代码和数据，网址为https://scalingintelligence.stanford.edu/pubs/codemonkeys。

English

Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.

代码猿：为软件工程扩展测试时间计算

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

摘要

Summary

Support