아폴로: 대규모 다중모달 모델에서의 비디오 이해 탐구

초록

대형 다중 모달 모델(LMMs)에 비디오 인식 기능이 신속히 통합되고 있지만, 이러한 비디오 이해를 이끌어가는 기본 메커니즘은 여전히 잘 이해되지 않고 있습니다. 따라서 이 도메인에서의 많은 설계 결정은 적절한 근거나 분석 없이 이루어지고 있습니다. 이러한 모델의 교육 및 평가에 따른 높은 계산 비용과 제한된 공개 연구는 비디오-LMMs의 발전을 방해합니다. 이에 대응하여 비디오-LMMs의 효과적인 이해를 돕는 포괄적인 연구를 제시합니다. 우리는 먼저 비디오-LMM 연구와 관련된 높은 계산 요구 사항의 주요 기여 요소를 비판적으로 검토하고, 작은 모델 및 데이터셋(임계 크기까지)에서 내린 설계 및 교육 결정이 큰 모델로 효과적으로 전이되는 Scaling Consistency를 발견합니다. 이러한 통찰력을 활용하여 비디오-LMMs의 비디오 특정 측면을 탐구했는데, 이는 비디오 샘플링, 아키텍처, 데이터 구성, 교육 일정 등을 포함합니다. 예를 들어, 교육 중 fps 샘플링이 균일한 프레임 샘플링보다 훨씬 선호되며 어떤 비전 인코더가 비디오 표현에 가장 적합한지를 시연했습니다. 이러한 발견을 바탕으로 우리는 다양한 모델 크기에서 우수한 성능을 달성하는 최첨단 LMMs 패밀리인 Apollo를 소개합니다. 우리의 모델은 Apollo-3B가 LongVideoBench에서 인상적인 55.1로 대부분의 기존 7B 모델을 능가하면서 효율적으로 1시간짜리 비디오를 인식할 수 있습니다. Apollo-7B는 MLVU에서 70.9, Video-MME에서 63.3으로 7B LMMs와 비교하여 최첨단입니다.

English

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

아폴로: 대규모 다중모달 모델에서의 비디오 이해 탐구

Apollo: An Exploration of Video Understanding in Large Multimodal Models

초록

Support