一种通过视频立方体强化压缩实现高效视频理解的语言多模态模型
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
April 21, 2025
作者: Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua
cs.AI
摘要
大型多模态模型(LMMs)在处理视频帧时采用统一感知方式,导致对具有内在变化时间信息密度的视频产生计算效率低下的问题。本文提出Quicksviewer,一种采用新感知范式的LMM,它利用Gumbel Softmax将非均匀密度的视频分割成不同立方体,随后对每个立方体进行统一重采样,以实现高效的视频理解。这一简单直观的方法根据视频的时间密度动态在线压缩,显著减少了时空冗余(整体压缩率达到45倍),同时支持大感受野的高效训练。我们通过三个阶段从语言主干逐步训练模型,得益于感知效率,每个阶段平均处理长达420秒/1fps的视频。仅使用总计0.8M的视频-文本样本进行训练,我们的模型在准确性上比采用固定分割策略的直接基线最高提升了8.72,证明了其性能的有效性。在Video-MME基准测试中,Quicksviewer在适度序列长度下仅需基线每帧所需token的5%即达到SOTA。采用此范式,增加输入帧数揭示了模型能力的明确幂律关系。经验证,立方体网络生成的片段有助于分析视频中的连续事件。
English
Large Multimodal Models (LMMs) uniformly perceive video frames, creating
computational inefficiency for videos with inherently varying temporal
information density. This paper present Quicksviewer, an LMM with new
perceiving paradigm that partitions a video of nonuniform density into varying
cubes using Gumbel Softmax, followed by a unified resampling for each cube to
achieve efficient video understanding. This simple and intuitive approach
dynamically compress video online based on its temporal density, significantly
reducing spatiotemporal redundancy (overall 45times compression rate), while
enabling efficient training with large receptive field. We train the model from
a language backbone through three progressive stages, each incorporating
lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.
With only 0.8M total video-text samples for training, our model outperforms the
direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in
accuracy, demonstrating the effectiveness in performance. On Video-MME,
Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\%
of tokens per frame required by baselines. With this paradigm, scaling up the
number of input frames reveals a clear power law of the model capabilities. It
is also empirically verified that the segments generated by the cubing network
can help for analyzing continuous events in videos.Summary
AI-Generated Summary