阿波羅:大型多模型中視頻理解的探索
Apollo: An Exploration of Video Understanding in Large Multimodal Models
December 13, 2024
作者: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
cs.AI
摘要
儘管大型多模型(LMMs)快速整合了視頻感知能力,但其驅動視頻理解的基本機制仍知之甚少。因此,在這個領域中,許多設計決策都是缺乏適當的理由或分析而做出的。訓練和評估這些模型的高計算成本,加上有限的開放研究,阻礙了視頻-LMMs的發展。為了應對這一問題,我們提出了一項全面研究,幫助揭示LMMs中有效推動視頻理解的因素。
我們首先對與視頻-LMM研究相關的高計算需求的主要貢獻因素進行了批判性檢驗,並發現了“比例一致性”,即在較小的模型和數據集(直到臨界尺寸)上做出的設計和訓練決策有效地轉移到較大的模型。利用這些見解,我們探索了視頻-LMMs的許多視頻特定方面,包括視頻採樣、架構、數據組成、訓練時間表等。例如,我們證明了在訓練期間進行fps採樣遠遠優於均勻幀採樣,以及哪些視覺編碼器最適合視頻表示。
在這些發現的指導下,我們介紹了Apollo,這是一個最先進的LMM系列,可以在不同模型尺寸上實現卓越性能。我們的模型可以高效地感知長達一小時的視頻,其中Apollo-3B在LongVideoBench上表現出色,超越了大多數現有的7B模型,達到55.1。Apollo-7B在MLVU上表現優異,與7B LMMs相比處於領先地位,分別為70.9和Video-MME上的63.3。
English
Despite the rapid integration of video perception capabilities into Large
Multimodal Models (LMMs), the underlying mechanisms driving their video
understanding remain poorly understood. Consequently, many design decisions in
this domain are made without proper justification or analysis. The high
computational cost of training and evaluating such models, coupled with limited
open research, hinders the development of video-LMMs. To address this, we
present a comprehensive study that helps uncover what effectively drives video
understanding in LMMs.
We begin by critically examining the primary contributors to the high
computational requirements associated with video-LMM research and discover
Scaling Consistency, wherein design and training decisions made on smaller
models and datasets (up to a critical size) effectively transfer to larger
models. Leveraging these insights, we explored many video-specific aspects of
video-LMMs, including video sampling, architectures, data composition, training
schedules, and more. For example, we demonstrated that fps sampling during
training is vastly preferable to uniform frame sampling and which vision
encoders are the best for video representation.
Guided by these findings, we introduce Apollo, a state-of-the-art family of
LMMs that achieve superior performance across different model sizes. Our models
can perceive hour-long videos efficiently, with Apollo-3B outperforming most
existing 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B is
state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on
Video-MME.Summary
AI-Generated Summary