Skywork R1V:开创多模态推理与思维链的先河
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
April 8, 2025
作者: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI
摘要
我們推出Skywork R1V,這是一款多模態推理模型,通過高效的多模態遷移方法,將R1系列大型語言模型(LLM)擴展至視覺模態。利用輕量級的視覺投影器,Skywork R1V實現了無需重新訓練基礎語言模型或視覺編碼器的無縫多模態適應。為加強視覺與文本的對齊,我們提出了一種混合優化策略,結合迭代監督微調(SFT)與群組相對策略優化(GRPO),顯著提升了跨模態整合效率。此外,我們引入了一種自適應長度的思維鏈蒸餾方法,用於推理數據生成。該方法動態優化推理鏈長度,從而提升推理效率並防止過度推理。實證評估顯示,僅擁有380億參數的Skywork R1V展現出競爭力,在MMMU基準測試中獲得69.0分,在MathVista上取得67.5分。同時,它保持了強大的文本推理能力,在AIME上獲得72.0分,在MATH500上達到94.0分。Skywork R1V的模型權重已公開發布,以促進開放性和可重現性。
English
We introduce Skywork R1V, a multimodal reasoning model extending the an
R1-series Large language models (LLM) to visual modalities via an efficient
multimodal transfer method. Leveraging a lightweight visual projector, Skywork
R1V facilitates seamless multimodal adaptation without necessitating retraining
of either the foundational language model or the vision encoder. To strengthen
visual-text alignment, we propose a hybrid optimization strategy that combines
Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization
(GRPO), significantly enhancing cross-modal integration efficiency.
Additionally, we introduce an adaptive-length Chain-of-Thought distillation
approach for reasoning data generation. This approach dynamically optimizes
reasoning chain lengths, thereby enhancing inference efficiency and preventing
excessive reasoning overthinking. Empirical evaluations demonstrate that
Skywork R1V, with only 38B parameters, delivers competitive performance,
achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista.
Meanwhile, it maintains robust textual reasoning performance, evidenced by
impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model
weights have been publicly released to promote openness and reproducibility.Summary
AI-Generated Summary