ChatPaper.aiChatPaper

视觉强化微调(Visual-RFT)

Visual-RFT: Visual Reinforcement Fine-Tuning

March 3, 2025
作者: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI

摘要

在大型推理模型(如OpenAI的o1)中,强化微调(Reinforcement Fine-Tuning, RFT)通过对其答案的反馈进行学习,这在微调数据稀缺的应用场景中尤为有效。近期开源项目如DeepSeek-R1表明,带有可验证奖励的强化学习是复现o1模型的关键方向之一。尽管R1风格模型在语言模型领域已展现出成功,但其在多模态领域的应用仍待深入探索。本研究提出了视觉强化微调(Visual-RFT),进一步拓展了RFT在视觉任务中的应用范围。具体而言,Visual-RFT首先利用大型视觉语言模型(LVLMs)为每个输入生成包含推理标记和最终答案的多个响应,随后通过我们提出的视觉感知可验证奖励函数,借助如群体相对策略优化(GRPO)等策略优化算法更新模型。针对不同的感知任务,我们设计了不同的可验证奖励函数,例如用于目标检测的交并比(IoU)奖励。在细粒度图像分类、少样本目标检测、推理定位以及开放词汇目标检测基准测试中,Visual-RFT相较于监督微调(SFT)展现了竞争性的性能和更优的泛化能力。例如,在仅有约100个样本的单样本细粒度图像分类任务中,Visual-RFT较基线模型提升了24.3%的准确率。在少样本目标检测任务中,Visual-RFT在COCO的双样本设置上超出基线21.9分,在LVIS上超出15.4分。我们的Visual-RFT代表了LVLMs微调范式的转变,提供了一种数据高效、奖励驱动的方法,增强了针对特定领域任务的推理能力和适应性。
English
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by 21.9 on COCO's two-shot setting and 15.4 on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.

Summary

AI-Generated Summary

PDF652March 4, 2025