InternVL3:探索开源多模态模型的高级训练与测试时优化方案
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
April 14, 2025
作者: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
cs.AI
摘要
我们推出InternVL3,这是InternVL系列中的一项重大进展,采用了一种原生多模态预训练范式。与将纯文本大语言模型(LLM)改造为支持视觉输入的多模态大语言模型(MLLM)不同,InternVL3在单一预训练阶段,从多样化的多模态数据和纯文本语料库中同时习得多模态与语言能力。这一统一的训练范式有效解决了传统MLLM后训练流程中常见的复杂性和对齐难题。为进一步提升性能和可扩展性,InternVL3引入了可变视觉位置编码(V2PE)以支持扩展的多模态上下文,采用了包括监督微调(SFT)和混合偏好优化(MPO)在内的先进后训练技术,并实施了测试时扩展策略及优化的训练基础设施。广泛的实证评估表明,InternVL3在多种多模态任务上均展现出卓越性能。特别地,InternVL3-78B在MMMU基准测试中取得了72.2分,创下了开源MLLM的新纪录。其能力与包括ChatGPT-4o、Claude 3.5 Sonnet和Gemini 2.5 Pro在内的领先专有模型保持高度竞争力,同时保持了强大的纯语言处理能力。秉承开放科学原则,我们将公开训练数据和模型权重,以促进下一代MLLM的进一步研究与开发。
English
We introduce InternVL3, a significant advancement in the InternVL series
featuring a native multimodal pre-training paradigm. Rather than adapting a
text-only large language model (LLM) into a multimodal large language model
(MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and
linguistic capabilities from both diverse multimodal data and pure-text corpora
during a single pre-training stage. This unified training paradigm effectively
addresses the complexities and alignment challenges commonly encountered in
conventional post-hoc training pipelines for MLLMs. To further improve
performance and scalability, InternVL3 incorporates variable visual position
encoding (V2PE) to support extended multimodal contexts, employs advanced
post-training techniques such as supervised fine-tuning (SFT) and mixed
preference optimization (MPO), and adopts test-time scaling strategies
alongside an optimized training infrastructure. Extensive empirical evaluations
demonstrate that InternVL3 delivers superior performance across a wide range of
multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the
MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its
capabilities remain highly competitive with leading proprietary models,
including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also
maintaining strong pure-language proficiency. In pursuit of open-science
principles, we will publicly release both the training data and model weights
to foster further research and development in next-generation MLLMs.Summary
AI-Generated Summary