阿波罗：大型多模型模型中视频理解的探索

摘要

尽管视频感知能力迅速整合到大型多模态模型（LMMs）中，但驱动其视频理解的基本机制仍知之甚少。因此，在该领域许多设计决策都缺乏适当的理由或分析。训练和评估这类模型的高计算成本，再加上有限的开放研究，阻碍了视频-LMMs的发展。为了解决这一问题，我们提出了一项全面研究，有助于揭示在LMMs中有效推动视频理解的因素。我们首先对与视频-LMM研究相关的高计算需求的主要贡献因素进行了批判性审视，并发现了“缩放一致性”，即在较小的模型和数据集上（达到临界大小）做出的设计和训练决策有效地转移到更大的模型上。利用这些见解，我们探讨了视频-LMMs的许多视频特定方面，包括视频采样、架构、数据组成、训练计划等。例如，我们证明了训练过程中的fps采样远比均匀帧采样更可取，以及哪些视觉编码器最适合视频表示。在这些发现的指导下，我们介绍了Apollo，这是一系列最先进的LMMs，能够在不同模型尺寸上实现卓越性能。我们的模型能够高效感知长达一小时的视频，其中Apollo-3B在LongVideoBench上表现出色，超过大多数现有的7B模型，达到55.1。与7B LMMs相比，Apollo-7B在MLVU上达到了70.9，在Video-MME上达到了63.3，处于行业领先地位。

English

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

阿波罗：大型多模型模型中视频理解的探索

Apollo: An Exploration of Video Understanding in Large Multimodal Models

摘要

Summary

Support