NVLM：開放式前沿多模態LLM模型

摘要

我們介紹了 NVLM 1.0，這是一系列前沿級多模式大型語言模型（LLMs），在視覺語言任務上取得了最先進的成果，與領先的專有模型（例如 GPT-4o）和開放訪問模型（例如 Llama 3-V 405B 和 InternVL 2）不相上下。值得注意的是，NVLM 1.0 在多模式訓練後展現出比其 LLMS 骨幹更好的僅文本性能。在模型設計方面，我們對僅解碼器多模式LLMs（例如 LLaVA）和基於交叉注意力的模型（例如 Flamingo）進行了全面比較。根據兩種方法的優勢和劣勢，我們提出了一種新穎的架構，增強了訓練效率和多模式推理能力。此外，我們為基於瓦片的動態高分辨率圖像引入了一種一維瓦片標記設計，這顯著提升了在多模式推理和OCR相關任務上的性能。關於訓練數據，我們精心挑選並提供了有關我們的多模式預訓練和監督微調數據集的詳細信息。我們的研究結果表明，數據集的質量和任務多樣性比規模更重要，即使在預訓練階段，對所有架構都是如此。值得注意的是，我們為 NVLM-1.0 模型開發了生產級多模式，使其在視覺語言任務中表現出色，同時與其 LLMS 骨幹相比保持甚至提升了僅文本性能。為實現此目標，我們將高質量的僅文本數據集與多模式訓練相結合，並提供大量多模式數學和推理數據，從而增強了跨模式的數學和編碼能力。為推進該領域的研究，我們將釋放模型權重並將代碼開源給社區：https://nvlm-project.github.io/。

English

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.

NVLM：開放式前沿多模態LLM模型

NVLM: Open Frontier-Class Multimodal LLMs

摘要

Summary

Support

Support