多语言问题解决基准：Multi-SWE-bench

摘要

问题修复任务旨在修改代码库以生成解决特定问题的补丁。然而，现有的基准测试，如SWE-bench，几乎仅专注于Python，这使得它们在评估大型语言模型（LLMs）跨多样化软件生态系统时显得不足。为此，我们引入了一个多语言问题修复基准，名为Multi-SWE-bench，涵盖Java、TypeScript、JavaScript、Go、Rust、C和C++。该基准包含总计1,632个高质量实例，这些实例由68位专家标注者从2,456个候选样本中精心标注，确保基准能够提供准确可靠的评估。基于Multi-SWE-bench，我们采用三种代表性方法（无代理、SWE代理和OpenHands）评估了一系列最先进的模型，并提供了包含关键实证见解的全面分析。此外，我们启动了Multi-SWE-RL开源社区，旨在构建大规模强化学习（RL）训练数据集，用于问题修复任务。作为初步贡献，我们发布了一组跨越七种编程语言的4,723个结构良好的实例，为该领域的RL研究奠定了坚实基础。更重要的是，我们开源了整个数据生产流程，并附有详细教程，鼓励开源社区持续贡献并扩展数据集。我们预见Multi-SWE-bench及不断壮大的Multi-SWE-RL社区将成为推动RL迈向其全部潜力的催化剂，让我们离通用人工智能（AGI）的曙光更近一步。

English

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

多语言问题解决基准：Multi-SWE-bench

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

摘要

Summary

Support

Support