Multi-SWE-bench：一個多語言問題解決基準測試平台

摘要

問題解決任務旨在修改程式碼庫以生成修補程式，從而解決特定問題。然而，現有的基準測試（如SWE-bench）幾乎完全專注於Python，這使得它們在評估大型語言模型（LLMs）於多樣化軟體生態系統中的表現時顯得不足。為此，我們引入了一個多語言問題解決基準測試，稱為Multi-SWE-bench，涵蓋Java、TypeScript、JavaScript、Go、Rust、C和C++。該基準測試共包含1,632個高品質實例，這些實例由68位專家註釋員從2,456個候選中精心挑選並註釋，確保基準測試能夠提供準確且可靠的評估。基於Multi-SWE-bench，我們使用三種代表性方法（無代理、SWE-agent和OpenHands）評估了一系列最先進的模型，並提供了全面的分析與關鍵的實證洞察。此外，我們啟動了Multi-SWE-RL開源社群，旨在為問題解決任務構建大規模的強化學習（RL）訓練資料集。作為初步貢獻，我們發布了一組包含4,723個結構良好的實例，涵蓋七種程式語言，為該領域的RL研究奠定了堅實基礎。更重要的是，我們開源了整個資料生產流程，並提供了詳細的教程，鼓勵開源社群持續貢獻並擴展資料集。我們期待Multi-SWE-bench和不斷壯大的Multi-SWE-RL社群能成為推動RL充分發揮潛力的催化劑，使我們更接近通用人工智慧（AGI）的黎明。

English

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Multi-SWE-bench：一個多語言問題解決基準測試平台

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

摘要

Summary

Support

Support