使用SWE-Gym訓練軟體工程代理人和驗證器

Training Software Engineering Agents and Verifiers with SWE-Gym

December 30, 2024
作者: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang
cs.AI

摘要

我們提出了 SWE-Gym,這是第一個用於訓練真實世界軟體工程 (SWE) 代理的環境。SWE-Gym 包含 2,438 個真實世界的 Python 任務實例,每個實例包括一個具有可執行運行環境、單元測試和以自然語言指定的任務的程式庫。我們使用 SWE-Gym 來訓練基於語言模型的 SWE 代理,實現在流行的 SWE-Bench Verified 和 Lite 測試集上高達 19% 的絕對改進率。我們還通過從 SWE-Gym 中抽樣的代理軌跡訓練驗證器,進行推論時間的擴展實驗。當與我們微調的 SWE 代理結合時,我們在 SWE-Bench Verified 和 Lite 上分別達到 32.0% 和 26.0%,反映了開放權重 SWE 代理的最新技術水準。為了促進進一步的研究,我們公開發布了 SWE-Gym、模型和代理軌跡。
English
We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

Summary

AI-Generated Summary

PDF212December 31, 2024