ChatPaper.aiChatPaper

透過強化視訊立方體壓縮實現高效視訊理解的輕量級多模態模型

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

April 21, 2025
作者: Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua
cs.AI

摘要

大型多模态模型(LMMs)通常以统一的方式感知视频帧,这导致在处理具有内在变化时间信息密度的视频时产生计算效率低下的问题。本文提出了Quicksviewer,这是一种采用新感知范式的LMM,它利用Gumbel Softmax将非均匀密度的视频分割成不同的立方体,随后对每个立方体进行统一重采样,以实现高效的视频理解。这种简单直观的方法根据视频的时间密度动态在线压缩视频,显著减少了时空冗余(整体压缩率达到45倍),同时支持具有大感受野的高效训练。我们通过三个渐进阶段从语言主干训练模型,得益于感知效率,每个阶段平均包含420秒/1帧的长视频。仅使用0.8M的视频-文本样本进行训练,我们的模型在准确性上比采用固定分割策略的直接基线最多高出8.72,展示了其性能的有效性。在Video-MME上,Quicksviewer在适度序列长度下实现了SOTA,仅使用基线每帧所需token的5%。通过这一范式,增加输入帧数揭示了模型能力的清晰幂律关系。经验证,立方体网络生成的片段有助于分析视频中的连续事件。
English
Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present Quicksviewer, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45times compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.

Summary

AI-Generated Summary

PDF103April 22, 2025