Skywork R1V2:多模态混合强化学习推理系统
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
April 23, 2025
作者: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI
摘要
我們隆重推出Skywork R1V2,這是一款次世代多模態推理模型,相較於前代Skywork R1V實現了重大飛躍。R1V2的核心創新在於引入了一種混合強化學習範式,該範式巧妙融合了獎勵模型指導與基於規則的策略,從而有效解決了在保持複雜推理能力與廣泛泛化性之間長期存在的平衡難題。為進一步提升訓練效率,我們提出了選擇性樣本緩衝(SSB)機制,該機制通過在優化過程中優先處理高價值樣本,有效應對了群體相對策略優化(GRPO)中固有的“優勢消失”困境。值得注意的是,我們觀察到過度的強化信號可能引發視覺幻覺——這一現象我們通過在訓練過程中設置校準的獎勵閾值進行系統監控與緩解。實證結果充分證明了R1V2的卓越能力,其在多項基準測試中均取得領先成績,如OlympiadBench上的62.6分、AIME2024上的79.0分、LiveCodeBench上的63.6分以及MMMU上的74.0分。這些成果不僅彰顯了R1V2相較於現有開源模型的優勢,更展示了其在縮小與頂級專有系統(如Gemini 2.5和OpenAI o4-mini)性能差距方面的顯著進展。為促進開放性與可重現性,Skywork R1V2的模型權重已公開發布於https://huggingface.co/Skywork/Skywork-R1V2-38B。
English
We present Skywork R1V2, a next-generation multimodal reasoning model and a
major leap forward from its predecessor, Skywork R1V. At its core, R1V2
introduces a hybrid reinforcement learning paradigm that harmonizes
reward-model guidance with rule-based strategies, thereby addressing the
long-standing challenge of balancing sophisticated reasoning capabilities with
broad generalization. To further enhance training efficiency, we propose the
Selective Sample Buffer (SSB) mechanism, which effectively counters the
``Vanishing Advantages'' dilemma inherent in Group Relative Policy Optimization
(GRPO) by prioritizing high-value samples throughout the optimization process.
Notably, we observe that excessive reinforcement signals can induce visual
hallucinations--a phenomenon we systematically monitor and mitigate through
calibrated reward thresholds throughout the training process. Empirical results
affirm the exceptional capability of R1V2, with benchmark-leading performances
such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and
74.0 on MMMU. These results underscore R1V2's superiority over existing
open-source models and demonstrate significant progress in closing the
performance gap with premier proprietary systems, including Gemini 2.5 and
OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to
promote openness and reproducibility
https://huggingface.co/Skywork/Skywork-R1V2-38B.Summary
AI-Generated Summary