OpenRFT:適應性推理基礎模型用於特定領域任務的強化微調

OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning

December 22, 2024
作者: Yuxiang Zhang, Yuqi Yang, Jiangming Shu, Yuhang Wang, Jinlin Xiao, Jitao Sang
cs.AI

摘要

OpenAI 最近推出的強化微調(RFT)展示了推理基礎模型的潛力,並提供了一種超越簡單模式模仿的微調新範式。本技術報告介紹了 OpenRFT,我們試圖在與 RFT 相同的設置下,對通用推理模型進行領域特定任務的微調。OpenRFT 通過三種方式利用領域特定樣本來應對缺乏推理步驟數據和有限的訓練樣本數量這兩個關鍵挑戰:問題擴增、合成推理過程數據和少樣本 ICL。在 SciKnowEval 上進行評估,OpenRFT 在每個任務僅使用 100 個領域特定樣本就實現了顯著的性能提升。更多實驗結果將在後續版本中持續更新。源代碼、數據集和模型可在以下網址找到:https://github.com/ADaM-BJTU/OpenRFT
English
OpenAI's recent introduction of Reinforcement Fine-Tuning (RFT) showcases the potential of reasoning foundation model and offers a new paradigm for fine-tuning beyond simple pattern imitation. This technical report presents OpenRFT, our attempt to fine-tune generalist reasoning models for domain-specific tasks under the same settings as RFT. OpenRFT addresses two key challenges of lacking reasoning step data and the limited quantity of training samples, by leveraging the domain-specific samples in three ways: question augmentation, synthesizing reasoning-process data, and few-shot ICL. The evaluation is conducted on SciKnowEval, where OpenRFT achieves notable performance gains with only 100 domain-specific samples for each task. More experimental results will be updated continuously in later versions. Source codes, datasets, and models are disclosed at: https://github.com/ADaM-BJTU/OpenRFT

Summary

AI-Generated Summary

Paper Overview

The paper introduces OpenRFT, a method that fine-tunes generalist reasoning models for domain-specific tasks using Reinforcement Fine-Tuning (RFT). It addresses challenges of limited data by leveraging domain-specific samples through question augmentation, reasoning-process data synthesis, and few-shot In-Context Learning (ICL). Experimental results on SciKnowEval demonstrate significant performance gains with minimal domain-specific samples.

Core Contribution

  • Introduction of OpenRFT for fine-tuning generalist reasoning models for domain-specific tasks.
  • Utilization of Reinforcement Fine-Tuning (RFT) to address data scarcity challenges.
  • Integration of domain-specific samples through question augmentation, reasoning-process data synthesis, and few-shot In-Context Learning (ICL).

Research Context

  • Positioning OpenRFT within the domain of fine-tuning generalist models for specific tasks.
  • Addressing the limitations of data scarcity and domain-specific performance enhancement.
  • Combining Reinforcement Learning (RL) with domain-specific data augmentation for improved model performance.

Keywords

Reinforcement Fine-Tuning (RFT), Domain-specific tasks, Question Augmentation, Few-shot In-Context Learning (ICL), SciKnowEval dataset

Background

The research background involves the need to enhance domain-specific performance of generalist reasoning models due to data scarcity. OpenRFT aims to bridge this gap by leveraging domain-specific samples through question augmentation, reasoning-process data synthesis, and few-shot In-Context Learning (ICL).

Research Gap

  • Limited availability of domain-specific data for fine-tuning generalist models.
  • Challenges in achieving high performance on domain-specific tasks with limited training samples.

Technical Challenges

  • Data scarcity for domain-specific tasks.
  • Enhancing model performance with minimal domain-specific samples.

Prior Approaches

  • Existing solutions lack efficient methods to fine-tune generalist models for domain-specific tasks.
  • Limited techniques for synthesizing reasoning-process data and leveraging domain-specific samples effectively.

Methodology

OpenRFT methodology involves data augmentation, SFT-based imitation, and RL-based exploration for fine-tuning generalist reasoning models for domain-specific tasks.

Theoretical Foundation

  • Utilization of Reinforcement Learning (RL) for fine-tuning generalist models.
  • Integration of domain-specific data augmentation techniques for enhanced model performance.

Technical Architecture

  • Data augmentation through question rewriting and option shuffling.
  • SFT-based imitation for reasoning process synthesis.
  • RL-based exploration with Process Reward Model (PRM) supervision.

Implementation Details

  • Utilization of Proximal Policy Optimization (PPO) algorithm for RL optimization.
  • Incorporation of outcome-based and process-based rewards in the reward function.
  • Teacher-student policy alignment for effective fine-tuning.

Innovation Points

  • Integration of domain-specific samples through question augmentation and reasoning-process data synthesis.
  • Utilization of RL with Process Reward Model (PRM) for effective fine-tuning.
  • Few-shot In-Context Learning (ICL) for domain knowledge embedding.

Experimental Validation

The experimental validation on SciKnowEval dataset demonstrates the effectiveness of OpenRFT in enhancing performance on reasoning level L3 tasks with minimal domain-specific samples.

Setup

  • Evaluation on SciKnowEval dataset with reasoning tasks at level L3.
  • Use of Skywork-o1 Open series models for policy and process reward models.
  • Training on NVIDIA H20-96GB GPUs using OpenRLHF for reinforcement learning.

Metrics

  • Average performance increase of 11% with only 100 domain-specific samples.
  • Comparison with baseline methods including Vanilla, SFT, ReFT, and variations of RL-based methods.

Results

  • o1-mini outperforms GPT-4o-mini in reasoning tasks.
  • SFT+RL(PRM)+DA achieves the best performance among comparative methods.

Comparative Analysis

  • Comparison with Vanilla, SFT, ReFT, and other variations of RL-based methods.
  • Demonstrates the effectiveness of OpenRFT in enhancing domain-specific performance.

Impact and Implications

The study highlights the significance of fine-tuning generalist reasoning models for domain-specific tasks and the potential implications for improving model performance with limited domain-specific samples.

Key Findings

  • OpenRFT significantly enhances performance on domain-specific tasks with minimal samples.
  • Teacher-student policy alignment is crucial for effective fine-tuning.
  • Integration of RL with domain-specific data augmentation improves model reasoning capabilities.

Limitations

  • Challenges related to inconsistent tasks and action modes in Reinforcement Fine-Tuning (RFT).
  • Dependency on domain-specific data for performance enhancement.

Future Directions

  • Enhancing domain knowledge embedding and data augmentation techniques.
  • Improving the general reasoning capabilities of foundation models for diverse tasks.

Practical Significance

  • Application of OpenRFT for fine-tuning generalist models in various domains.
  • Potential for real-world applications requiring domain-specific reasoning capabilities.

熱門論文

1比特LLM時代:所有大型語言模型都在1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu WeiFeb 27, 2024612142

DeepSeek-R1:通過強化學習激勵LLM中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253735

Qwen2.5 技術報告
Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202436311

PDF92December 24, 2024