InternVL3：探索開源多模態模型的高階訓練與測試時優化方案

摘要

我們推出InternVL3，這是InternVL系列中的一項重大進展，其特點在於原生多模態預訓練範式。與其將僅限於文本的大型語言模型（LLM）改造成支持視覺輸入的多模態大型語言模型（MLLM），InternVL3在單一預訓練階段中，從多樣化的多模態數據和純文本語料庫中共同獲取多模態與語言能力。這一統一的訓練範式有效解決了傳統MLLM後續訓練管道中常見的複雜性和對齊挑戰。為了進一步提升性能和可擴展性，InternVL3採用了可變視覺位置編碼（V2PE）以支持擴展的多模態上下文，應用如監督微調（SFT）和混合偏好優化（MPO）等先進的後訓練技術，並採用了測試時縮放策略及優化的訓練基礎設施。廣泛的實證評估表明，InternVL3在多種多模態任務上均展現出卓越性能。特別是，InternVL3-78B在MMMU基準測試中獲得72.2分，創下了開源MLLM的新紀錄。其能力與領先的專有模型，包括ChatGPT-4o、Claude 3.5 Sonnet和Gemini 2.5 Pro，保持高度競爭力，同時也維持了強大的純語言能力。秉持開放科學原則，我們將公開訓練數據和模型權重，以促進下一代MLLM的進一步研究與開發。

English

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

InternVL3：探索開源多模態模型的高階訓練與測試時優化方案

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

摘要

Summary

Support

Support