ChatPaper.aiChatPaper

SoRFT:基于子任务导向的强化微调问题解决

SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning

February 27, 2025
作者: Zexiong Ma, Chao Peng, Pengfei Gao, Xiangxin Meng, Yanzhen Zou, Bing Xie
cs.AI

摘要

主流问题解决框架主要依赖商用模型,导致高成本和隐私隐患。现有问题解决训练方法普遍存在泛化能力差的问题,且未能充分利用开源开发资源。我们提出了面向子任务的强化微调(SoRFT),这是一种新颖的训练方法,旨在提升大语言模型(LLMs)的问题解决能力。该方法将问题解决分解为结构化子任务:文件定位、函数定位、行定位及代码编辑生成。SoRFT包含两个训练阶段:(1)基于拒绝采样的监督微调,即在微调LLM前,使用真实数据过滤链式思维(CoT)数据;(2)基于规则的强化学习,利用近端策略优化(PPO)算法并结合真实数据奖励机制。我们在SWE-Bench Verified和SWE-Bench Lite数据集上评估了SoRFT训练后的模型,在开源模型中实现了最先进的(SOTA)性能(例如,SoRFT-Qwen-7B在SWE-Bench Verified上解决了21.4%的问题)。实验结果表明,SoRFT显著提升了问题解决性能,增强了模型泛化能力,并为商用模型提供了一种成本效益高的替代方案。
English
Mainstream issue-resolving frameworks predominantly rely on commercial models, leading to high costs and privacy concerns. Existing training approaches for issue resolving struggle with poor generalization and fail to fully leverage open-source development resources. We propose Subtask-oriented Reinforced Fine-Tuning (SoRFT), a novel training approach to enhance the issue resolving capability of LLMs. We decomposes issue resolving into structured subtasks: file localization, function localization, line localization, and code edit generation. SoRFT consists of two training stages: (1) rejection-sampled supervised fine-tuning, Chain of Thought (CoT) data is filtered using ground-truth before fine-tuning the LLM, and (2) rule-based reinforcement learning, which leverages PPO with ground-truth based rewards. We evaluate the SoRFT-trained model on SWE-Bench Verified and SWE-Bench Lite, achieving state-of-the-art (SOTA) performance among open-source models (e.g., resolve 21.4% issues on SWE-Bench Verified with SoRFT-Qwen-7B). The experimental results demonstrate that SoRFT significantly enhances issue-resolving performance, improves model generalization, and provides a cost-efficient alternative to commercial models.

Summary

AI-Generated Summary

PDF92February 28, 2025