多语言问题解决基准:Multi-SWE-bench
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
April 3, 2025
作者: Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang
cs.AI
摘要
问题修复任务旨在修改代码库以生成解决特定问题的补丁。然而,现有的基准测试,如SWE-bench,几乎仅专注于Python,这使得它们在评估大型语言模型(LLMs)跨多样化软件生态系统时显得不足。为此,我们引入了一个多语言问题修复基准,名为Multi-SWE-bench,涵盖Java、TypeScript、JavaScript、Go、Rust、C和C++。该基准包含总计1,632个高质量实例,这些实例由68位专家标注者从2,456个候选样本中精心标注,确保基准能够提供准确可靠的评估。基于Multi-SWE-bench,我们采用三种代表性方法(无代理、SWE代理和OpenHands)评估了一系列最先进的模型,并提供了包含关键实证见解的全面分析。此外,我们启动了Multi-SWE-RL开源社区,旨在构建大规模强化学习(RL)训练数据集,用于问题修复任务。作为初步贡献,我们发布了一组跨越七种编程语言的4,723个结构良好的实例,为该领域的RL研究奠定了坚实基础。更重要的是,我们开源了整个数据生产流程,并附有详细教程,鼓励开源社区持续贡献并扩展数据集。我们预见Multi-SWE-bench及不断壮大的Multi-SWE-RL社区将成为推动RL迈向其全部潜力的催化剂,让我们离通用人工智能(AGI)的曙光更近一步。
English
The task of issue resolving is to modify a codebase to generate a patch that
addresses a given issue. However, existing benchmarks, such as SWE-bench, focus
almost exclusively on Python, making them insufficient for evaluating Large
Language Models (LLMs) across diverse software ecosystems. To address this, we
introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench,
covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a
total of 1,632 high-quality instances, which were carefully annotated from
2,456 candidates by 68 expert annotators, ensuring that the benchmark can
provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we
evaluate a series of state-of-the-art models using three representative methods
(Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with
key empirical insights. In addition, we launch a Multi-SWE-RL open-source
community, aimed at building large-scale reinforcement learning (RL) training
datasets for issue-resolving tasks. As an initial contribution, we release a
set of 4,723 well-structured instances spanning seven programming languages,
laying a solid foundation for RL research in this domain. More importantly, we
open-source our entire data production pipeline, along with detailed tutorials,
encouraging the open-source community to continuously contribute and expand the
dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL
community as catalysts for advancing RL toward its full potential, bringing us
one step closer to the dawn of AGI.Summary
AI-Generated Summary