Skywork R1V:开创性多模态推理与思维链技术
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
April 8, 2025
作者: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI
摘要
我们推出Skywork R1V,这是一种多模态推理模型,通过高效的多模态迁移方法,将R1系列大型语言模型(LLM)扩展至视觉模态。借助轻量级视觉投影器,Skywork R1V实现了无缝的多模态适应,无需重新训练基础语言模型或视觉编码器。为加强视觉与文本的对齐,我们提出了一种混合优化策略,结合迭代监督微调(SFT)与群体相对策略优化(GRPO),显著提升了跨模态整合效率。此外,我们引入了一种自适应长度的思维链蒸馏方法,用于推理数据生成。该方法动态优化推理链长度,从而提升推理效率,避免过度推理。实证评估表明,仅拥有380亿参数的Skywork R1V展现出竞争力,在MMMU基准测试中取得69.0分,在MathVista上获得67.5分。同时,它保持了强劲的文本推理能力,在AIME上获得72.0分,在MATH500上达到94.0分。Skywork R1V的模型权重已公开发布,以促进开放性和可复现性。
English
We introduce Skywork R1V, a multimodal reasoning model extending the an
R1-series Large language models (LLM) to visual modalities via an efficient
multimodal transfer method. Leveraging a lightweight visual projector, Skywork
R1V facilitates seamless multimodal adaptation without necessitating retraining
of either the foundational language model or the vision encoder. To strengthen
visual-text alignment, we propose a hybrid optimization strategy that combines
Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization
(GRPO), significantly enhancing cross-modal integration efficiency.
Additionally, we introduce an adaptive-length Chain-of-Thought distillation
approach for reasoning data generation. This approach dynamically optimizes
reasoning chain lengths, thereby enhancing inference efficiency and preventing
excessive reasoning overthinking. Empirical evaluations demonstrate that
Skywork R1V, with only 38B parameters, delivers competitive performance,
achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista.
Meanwhile, it maintains robust textual reasoning performance, evidenced by
impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model
weights have been publicly released to promote openness and reproducibility.Summary
AI-Generated Summary