Mavors:面向多模态大语言模型的多粒度视频表征
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
April 14, 2025
作者: Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang
cs.AI
摘要
在多模态大语言模型(MLLMs)中,长上下文视频理解面临一个关键挑战:如何在计算效率与保留细粒度时空模式之间取得平衡。现有方法(如稀疏采样、低分辨率密集采样和令牌压缩)在时间动态、空间细节或微妙交互方面存在显著信息丢失,特别是在具有复杂运动或变化分辨率的视频中。为解决这一问题,我们提出了Mavors,一个新颖的框架,引入多粒度视频表示以实现整体长视频建模。具体而言,Mavors通过两个核心组件直接编码原始视频内容为潜在表示:1)一个内部块视觉编码器(IVE),通过3D卷积和视觉变换器保留高分辨率空间特征;2)一个跨块特征聚合器(IFA),使用基于变换器的依赖建模和块级旋转位置编码建立跨块的时间连贯性。此外,该框架通过子图像分解将图像视为单帧视频,统一了图像和视频理解。在多个基准测试中的实验表明,Mavors在保持空间保真度和时间连续性方面具有显著优势,在需要细粒度时空推理的任务中显著优于现有方法。
English
Long-context video understanding in multimodal large language models (MLLMs)
faces a critical challenge: balancing computational efficiency with the
retention of fine-grained spatio-temporal patterns. Existing approaches (e.g.,
sparse sampling, dense sampling with low resolution, and token compression)
suffer from significant information loss in temporal dynamics, spatial details,
or subtle interactions, particularly in videos with complex motion or varying
resolutions. To address this, we propose Mavors, a novel framework
that introduces Multi-granularity
video representation for holistic
long-video modeling. Specifically, Mavors directly encodes raw video content
into latent representations through two core components: 1) an Intra-chunk
Vision Encoder (IVE) that preserves high-resolution spatial features via 3D
convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator
(IFA) that establishes temporal coherence across chunks using transformer-based
dependency modeling with chunk-level rotary position encodings. Moreover, the
framework unifies image and video understanding by treating images as
single-frame videos via sub-image decomposition. Experiments across diverse
benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity
and temporal continuity, significantly outperforming existing methods in tasks
requiring fine-grained spatio-temporal reasoning.Summary
AI-Generated Summary