InternVL3:探索開源多模態模型的高階訓練與測試時優化方案
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
April 14, 2025
作者: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
cs.AI
摘要
我們推出InternVL3,這是InternVL系列中的一項重大進展,其特點在於原生多模態預訓練範式。與其將僅限於文本的大型語言模型(LLM)改造成支持視覺輸入的多模態大型語言模型(MLLM),InternVL3在單一預訓練階段中,從多樣化的多模態數據和純文本語料庫中共同獲取多模態與語言能力。這一統一的訓練範式有效解決了傳統MLLM後續訓練管道中常見的複雜性和對齊挑戰。為了進一步提升性能和可擴展性,InternVL3採用了可變視覺位置編碼(V2PE)以支持擴展的多模態上下文,應用如監督微調(SFT)和混合偏好優化(MPO)等先進的後訓練技術,並採用了測試時縮放策略及優化的訓練基礎設施。廣泛的實證評估表明,InternVL3在多種多模態任務上均展現出卓越性能。特別是,InternVL3-78B在MMMU基準測試中獲得72.2分,創下了開源MLLM的新紀錄。其能力與領先的專有模型,包括ChatGPT-4o、Claude 3.5 Sonnet和Gemini 2.5 Pro,保持高度競爭力,同時也維持了強大的純語言能力。秉持開放科學原則,我們將公開訓練數據和模型權重,以促進下一代MLLM的進一步研究與開發。
English
We introduce InternVL3, a significant advancement in the InternVL series
featuring a native multimodal pre-training paradigm. Rather than adapting a
text-only large language model (LLM) into a multimodal large language model
(MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and
linguistic capabilities from both diverse multimodal data and pure-text corpora
during a single pre-training stage. This unified training paradigm effectively
addresses the complexities and alignment challenges commonly encountered in
conventional post-hoc training pipelines for MLLMs. To further improve
performance and scalability, InternVL3 incorporates variable visual position
encoding (V2PE) to support extended multimodal contexts, employs advanced
post-training techniques such as supervised fine-tuning (SFT) and mixed
preference optimization (MPO), and adopts test-time scaling strategies
alongside an optimized training infrastructure. Extensive empirical evaluations
demonstrate that InternVL3 delivers superior performance across a wide range of
multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the
MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its
capabilities remain highly competitive with leading proprietary models,
including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also
maintaining strong pure-language proficiency. In pursuit of open-science
principles, we will publicly release both the training data and model weights
to foster further research and development in next-generation MLLMs.Summary
AI-Generated Summary