NVLM: 오픈 프론티어-클래스 다중 모달 LLMs

초록

우리는 NVLM 1.0을 소개합니다. 이는 최첨단 멀티모달 대형 언어 모델(Large Language Models, LLMs) 패밀리로, 시각-언어 작업에서 최첨단 결과를 달성하며 선도적인 프로프리어터리 모델(예: GPT-4o) 및 오픈 액세스 모델(예: Llama 3-V 405B 및 InternVL 2)과 견줄만한 성과를 보여줍니다. NVLM 1.0은 멀티모달 훈련 후 LLM 백본에 비해 텍스트 전용 성능이 향상된 것으로 나타납니다. 모델 설계 측면에서, 우리는 디코더 전용 멀티모달 LLM(e.g., LLaVA)과 교차 어텐션 기반 모델(e.g., Flamingo) 사이의 포괄적인 비교를 수행합니다. 두 접근법의 장단점을 고려하여, 훈련 효율성과 멀티모달 추론 능력을 모두 향상시키는 새로운 아키텍처를 제안합니다. 더불어, 우리는 다이내믹 고해상도 이미지를 위한 1-D 타일 태깅 디자인을 소개하며, 이는 멀티모달 추론 및 OCR 관련 작업의 성능을 크게 향상시킵니다. 훈련 데이터 측면에서, 우리는 멀티모달 사전 훈련 및 지도 학습 데이터셋에 대해 면밀히 선별하고 상세한 정보를 제공합니다. 우리의 연구 결과는 모든 아키텍처에서 사전 훈련 단계에서도 데이터셋 품질과 작업 다양성이 규모보다 중요하다는 것을 나타냅니다. 특히, 우리는 NVLM-1.0 모델에 대한 프로덕션급 멀티모달리티를 개발하여, 이 모델들이 시각-언어 작업에서 뛰어나면서도 LLM 백본에 비해 텍스트 전용 성능을 유지하거나 향상시킬 수 있도록 합니다. 이를 위해, 우리는 고품질 텍스트 전용 데이터셋을 멀티모달 훈련에 통합하고, 상당량의 멀티모달 수학 및 추론 데이터를 추가하여, 각 모달리티 간 수학 및 코딩 능력을 향상시킵니다. 이 분야의 연구를 발전시키기 위해, 우리는 모델 가중치를 공개하고 커뮤니티를 위해 코드를 오픈 소스로 공개할 예정입니다: https://nvlm-project.github.io/.

English

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.

NVLM: 오픈 프론티어-클래스 다중 모달 LLMs

NVLM: Open Frontier-Class Multimodal LLMs

초록

Summary

Support

Support