코드몽키스: 소프트웨어 엔지니어링을 위한 테스트 시간 계산 확장

초록

테스트 시간 계산 능력을 향상시키는 유망한 방법 중 하나는 스케일링이다. 그러나 테스트 시간 계산은 다양한 방법으로 확장될 수 있으며, 서로 다른 접근 방식을 효과적으로 결합하는 것은 아직 활발히 연구되고 있는 분야이다. 본 연구에서는 SWE-bench 데이터셋의 실제 GitHub 이슈 해결을 위한 맥락에서 이 문제를 탐구한다. 우리의 시스템인 CodeMonkeys는 모델이 테스트 스크립트를 생성하고 실행하는 동시에 코드베이스를 반복적으로 편집할 수 있도록 함으로써 이를 가능하게 한다. 우리는 각 이슈에 대해 이러한 다중 턴 궤적을 샘플링하여 후보 편집의 집합을 생성한다. 이 접근 방식을 통해 우리는 "직렬" 테스트 시간 계산을 궤적 당 반복 횟수를 늘리는 것과 "병렬" 테스트 시간 계산을 문제 당 궤적 수를 증가시킴으로써 확장할 수 있다. 병렬 스케일링을 통해 우리는 다수의 하류 샘플에 초기 비용을 분담함으로써 간단한 방법으로 LLM이 모든 파일을 읽도록 하는 것을 통해 관련 코드베이스 컨텍스트를 식별할 수 있다. 후보 편집을 선택하기 위해 우리는 모델이 생성한 테스트를 사용하여 투표를 결합하고 선택을 위한 최종 다중 턴 궤적을 사용한다. 전반적으로, CodeMonkeys는 약 2300 달러의 예산을 사용하여 SWE-bench Verified의 이슈 중 57.4%를 해결한다. 우리의 선택 방법은 또한 다른 소스에서 후보를 결합하는 데 사용될 수 있다. 기존 최고의 SWE-bench Verified 제출물에서 편집 앙상블을 선택하면 66.2%의 점수를 얻어 앙상블의 최고 구성원을 단독으로 능가한다. 우리는 코드와 데이터를 완전히 공개하며 https://scalingintelligence.stanford.edu/pubs/codemonkeys에서 확인할 수 있다.

English

Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.

코드몽키스: 소프트웨어 엔지니어링을 위한 테스트 시간 계산 확장

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

초록

Support