VideoLLaMA 3:面向图像和视频理解的前沿多模态基础模型
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
January 22, 2025
作者: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
cs.AI
摘要
在本文中,我们提出了VideoLLaMA3,这是一个更先进的用于图像和视频理解的多模态基础模型。VideoLLaMA3的核心设计理念是以视觉为中心。所谓“以视觉为中心”有两层含义:视觉为中心的训练范式和视觉为中心的框架设计。我们视觉为中心的训练范式的关键洞察是,高质量的图像文本数据对于图像和视频理解至关重要。我们不再准备大规模的视频文本数据集,而是专注于构建大规模且高质量的图像文本数据集。VideoLLaMA3包括四个训练阶段:1)视觉为中心的对齐阶段,用于热身视觉编码器和投影仪;2)视觉语言预训练阶段,通过大规模图像文本数据(包括场景图像、文档、图表)以及纯文本数据,联合调整视觉编码器、投影仪和LLM;3)多任务微调阶段,结合图像文本SFT数据用于下游任务和视频文本数据以建立视频理解的基础;4)视频为中心的微调,进一步提升模型在视频理解方面的能力。至于框架设计,为了更好地捕捉图像中的细粒度细节,预训练的视觉编码器被调整为将不同尺寸的图像编码为具有相应数量的视觉标记,而不是固定数量的标记。对于视频输入,我们根据它们的相似性减少视觉标记的数量,以使视频的表示更加精确和紧凑。受益于视觉为中心的设计,VideoLLaMA3在图像和视频理解基准测试中取得了引人注目的表现。
English
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation
model for image and video understanding. The core design philosophy of
VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the
vision-centric training paradigm and vision-centric framework design. The key
insight of our vision-centric training paradigm is that high-quality image-text
data is crucial for both image and video understanding. Instead of preparing
massive video-text datasets, we focus on constructing large-scale and
high-quality image-text datasets. VideoLLaMA3 has four training stages: 1)
vision-centric alignment stage, which warms up the vision encoder and
projector; 2) vision-language pretraining stage, which jointly tunes the vision
encoder, projector, and LLM with large-scale image-text data covering multiple
types (including scene images, documents, charts) as well as text-only data. 3)
multi-task fine-tuning stage, which incorporates image-text SFT data for
downstream tasks and video-text data to establish a foundation for video
understanding. 4) video-centric fine-tuning, which further improves the model's
capability in video understanding. As for the framework design, to better
capture fine-grained details in images, the pretrained vision encoder is
adapted to encode images of varying sizes into vision tokens with corresponding
numbers, rather than a fixed number of tokens. For video inputs, we reduce the
number of vision tokens according to their similarity so that the
representation of videos will be more precise and compact. Benefit from
vision-centric designs, VideoLLaMA3 achieves compelling performances in both
image and video understanding benchmarks.Summary
AI-Generated Summary