ChatPaper.aiChatPaper

SWE-RL:通过开源软件演化中的强化学习推动大语言模型推理能力发展

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

February 25, 2025
作者: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang
cs.AI

摘要

近期发布的DeepSeek-R1展示了强化学习(RL)在提升大型语言模型(LLMs)通用推理能力方面的巨大潜力。尽管DeepSeek-R1及其后续研究主要聚焦于将RL应用于编程竞赛和数学问题,本文则首次提出了SWE-RL,这一方法旨在将基于RL的LLM推理扩展至现实世界的软件工程领域。通过利用轻量级的基于规则的奖励机制(例如,真实解决方案与LLM生成方案之间的相似度评分),SWE-RL使LLMs能够从海量的开源软件演化数据中自主学习,自主恢复开发者的推理过程及解决方案——这些数据记录了软件的全生命周期,包括代码快照、代码变更以及如问题和拉取请求等事件。基于Llama 3训练,我们最终得到的推理模型Llama3-SWE-RL-70B,在SWE-bench Verified上达到了41.0%的解决率,这是一个人工验证的真实世界GitHub问题集合。据我们所知,这是迄今为止中型(<100B)LLMs报告的最佳性能,甚至可与GPT-4o等领先的专有LLMs相媲美。令人惊讶的是,尽管仅在软件演化数据上进行RL训练,Llama3-SWE-RL却展现出了泛化的推理能力。例如,在五个跨领域任务上——函数编码、库使用、代码推理、数学及通用语言理解——均取得了改进,而相比之下,监督微调的基线模型平均表现反而有所下降。总体而言,SWE-RL为通过大规模软件工程数据进行强化学习,从而提升LLMs的推理能力开辟了新的方向。
English
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

Summary

AI-Generated Summary

PDF695February 26, 2025