ChatPaper.aiChatPaper

面向多模态大语言模型的令牌高效长视频理解

Token-Efficient Long Video Understanding for Multimodal LLMs

March 6, 2025
作者: Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
cs.AI

摘要

近期,基于视频的多模态大语言模型(Video-LLMs)通过将视频处理为图像帧序列,显著提升了视频理解能力。然而,许多现有方法在视觉骨干网络中独立处理每一帧,缺乏显式的时间建模,这限制了它们捕捉动态模式及高效处理长视频的能力。为解决这些局限,我们提出了STORM(面向多模态大语言模型的时空令牌缩减),一种新颖的架构,它在图像编码器与大语言模型之间引入了一个专门的时间编码器。我们的时间编码器利用Mamba状态空间模型,将时间信息融入图像令牌中,生成能够保留整个视频序列帧间动态的丰富表示。这种增强的编码不仅提升了视频推理能力,还支持有效的令牌缩减策略,包括测试时采样和基于训练的时间与空间池化,从而在不牺牲关键时间信息的前提下,大幅降低了大语言模型的计算需求。通过整合这些技术,我们的方法在提升性能的同时,减少了训练和推理的延迟,实现了在长时间上下文中的高效且稳健的视频理解。广泛的评估表明,STORM在多个长视频理解基准测试中(如MLVU和LongVideoBench上提升超过5%)取得了最先进的成果,同时在固定输入帧数的情况下,计算成本最多降低了8倍,解码延迟减少了2.4至2.9倍。项目页面详见https://research.nvidia.com/labs/lpr/storm。
English
Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8times and the decoding latency by 2.4-2.9times for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

Summary

AI-Generated Summary

PDF792March 7, 2025