阿波羅：大型多模型中視頻理解的探索

摘要

儘管大型多模型（LMMs）快速整合了視頻感知能力，但其驅動視頻理解的基本機制仍知之甚少。因此，在這個領域中，許多設計決策都是缺乏適當的理由或分析而做出的。訓練和評估這些模型的高計算成本，加上有限的開放研究，阻礙了視頻-LMMs的發展。為了應對這一問題，我們提出了一項全面研究，幫助揭示LMMs中有效推動視頻理解的因素。我們首先對與視頻-LMM研究相關的高計算需求的主要貢獻因素進行了批判性檢驗，並發現了“比例一致性”，即在較小的模型和數據集（直到臨界尺寸）上做出的設計和訓練決策有效地轉移到較大的模型。利用這些見解，我們探索了視頻-LMMs的許多視頻特定方面，包括視頻採樣、架構、數據組成、訓練時間表等。例如，我們證明了在訓練期間進行fps採樣遠遠優於均勻幀採樣，以及哪些視覺編碼器最適合視頻表示。在這些發現的指導下，我們介紹了Apollo，這是一個最先進的LMM系列，可以在不同模型尺寸上實現卓越性能。我們的模型可以高效地感知長達一小時的視頻，其中Apollo-3B在LongVideoBench上表現出色，超越了大多數現有的7B模型，達到55.1。Apollo-7B在MLVU上表現優異，與7B LMMs相比處於領先地位，分別為70.9和Video-MME上的63.3。

English

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

阿波羅：大型多模型中視頻理解的探索

Apollo: An Exploration of Video Understanding in Large Multimodal Models

摘要

Support