阿波罗:大型多模型模型中视频理解的探索
Apollo: An Exploration of Video Understanding in Large Multimodal Models
December 13, 2024
作者: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
cs.AI
摘要
尽管视频感知能力迅速整合到大型多模态模型(LMMs)中,但驱动其视频理解的基本机制仍知之甚少。因此,在该领域许多设计决策都缺乏适当的理由或分析。训练和评估这类模型的高计算成本,再加上有限的开放研究,阻碍了视频-LMMs的发展。为了解决这一问题,我们提出了一项全面研究,有助于揭示在LMMs中有效推动视频理解的因素。
我们首先对与视频-LMM研究相关的高计算需求的主要贡献因素进行了批判性审视,并发现了“缩放一致性”,即在较小的模型和数据集上(达到临界大小)做出的设计和训练决策有效地转移到更大的模型上。利用这些见解,我们探讨了视频-LMMs的许多视频特定方面,包括视频采样、架构、数据组成、训练计划等。例如,我们证明了训练过程中的fps采样远比均匀帧采样更可取,以及哪些视觉编码器最适合视频表示。
在这些发现的指导下,我们介绍了Apollo,这是一系列最先进的LMMs,能够在不同模型尺寸上实现卓越性能。我们的模型能够高效感知长达一小时的视频,其中Apollo-3B在LongVideoBench上表现出色,超过大多数现有的7B模型,达到55.1。与7B LMMs相比,Apollo-7B在MLVU上达到了70.9,在Video-MME上达到了63.3,处于行业领先地位。
English
Despite the rapid integration of video perception capabilities into Large
Multimodal Models (LMMs), the underlying mechanisms driving their video
understanding remain poorly understood. Consequently, many design decisions in
this domain are made without proper justification or analysis. The high
computational cost of training and evaluating such models, coupled with limited
open research, hinders the development of video-LMMs. To address this, we
present a comprehensive study that helps uncover what effectively drives video
understanding in LMMs.
We begin by critically examining the primary contributors to the high
computational requirements associated with video-LMM research and discover
Scaling Consistency, wherein design and training decisions made on smaller
models and datasets (up to a critical size) effectively transfer to larger
models. Leveraging these insights, we explored many video-specific aspects of
video-LMMs, including video sampling, architectures, data composition, training
schedules, and more. For example, we demonstrated that fps sampling during
training is vastly preferable to uniform frame sampling and which vision
encoders are the best for video representation.
Guided by these findings, we introduce Apollo, a state-of-the-art family of
LMMs that achieve superior performance across different model sizes. Our models
can perceive hour-long videos efficiently, with Apollo-3B outperforming most
existing 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B is
state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on
Video-MME.Summary
AI-Generated Summary