Multi-SWE-bench:一個多語言問題解決基準測試平台
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
April 3, 2025
作者: Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang
cs.AI
摘要
問題解決任務旨在修改程式碼庫以生成修補程式,從而解決特定問題。然而,現有的基準測試(如SWE-bench)幾乎完全專注於Python,這使得它們在評估大型語言模型(LLMs)於多樣化軟體生態系統中的表現時顯得不足。為此,我們引入了一個多語言問題解決基準測試,稱為Multi-SWE-bench,涵蓋Java、TypeScript、JavaScript、Go、Rust、C和C++。該基準測試共包含1,632個高品質實例,這些實例由68位專家註釋員從2,456個候選中精心挑選並註釋,確保基準測試能夠提供準確且可靠的評估。基於Multi-SWE-bench,我們使用三種代表性方法(無代理、SWE-agent和OpenHands)評估了一系列最先進的模型,並提供了全面的分析與關鍵的實證洞察。此外,我們啟動了Multi-SWE-RL開源社群,旨在為問題解決任務構建大規模的強化學習(RL)訓練資料集。作為初步貢獻,我們發布了一組包含4,723個結構良好的實例,涵蓋七種程式語言,為該領域的RL研究奠定了堅實基礎。更重要的是,我們開源了整個資料生產流程,並提供了詳細的教程,鼓勵開源社群持續貢獻並擴展資料集。我們期待Multi-SWE-bench和不斷壯大的Multi-SWE-RL社群能成為推動RL充分發揮潛力的催化劑,使我們更接近通用人工智慧(AGI)的黎明。
English
The task of issue resolving is to modify a codebase to generate a patch that
addresses a given issue. However, existing benchmarks, such as SWE-bench, focus
almost exclusively on Python, making them insufficient for evaluating Large
Language Models (LLMs) across diverse software ecosystems. To address this, we
introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench,
covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a
total of 1,632 high-quality instances, which were carefully annotated from
2,456 candidates by 68 expert annotators, ensuring that the benchmark can
provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we
evaluate a series of state-of-the-art models using three representative methods
(Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with
key empirical insights. In addition, we launch a Multi-SWE-RL open-source
community, aimed at building large-scale reinforcement learning (RL) training
datasets for issue-resolving tasks. As an initial contribution, we release a
set of 4,723 well-structured instances spanning seven programming languages,
laying a solid foundation for RL research in this domain. More importantly, we
open-source our entire data production pipeline, along with detailed tutorials,
encouraging the open-source community to continuously contribute and expand the
dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL
community as catalysts for advancing RL toward its full potential, bringing us
one step closer to the dawn of AGI.Summary
AI-Generated Summary