视频多模态大语言模型的慢快架构

摘要

在有限的计算预算下平衡时间分辨率与空间细节，仍是基于视频的多模态大语言模型（MLLMs）面临的核心挑战。现有方法通常采用预定义规则压缩视频表示后再输入大语言模型，导致不可逆的信息丢失，且常忽视输入指令。为此，我们提出了一种新颖的慢-快架构，自然规避了这一权衡，能够在保留空间细节的同时利用更多输入帧。受人类先快速浏览视频再聚焦相关部分的启发，我们的慢-快设计采用双令牌策略：1）“快”视觉令牌——一组紧凑的压缩视频特征——与文本嵌入一同输入大语言模型，提供快速概览；2）“慢”视觉令牌——未压缩的视频特征——通过特别设计的混合解码器层与文本嵌入进行交叉注意力，实现指令感知的相关视觉细节提取，且计算复杂度为线性。我们系统性地探索了整体架构与关键组件的优化。实验表明，我们的模型显著优于仅依赖自注意力的基线，在计算量仅增加3%的情况下，将输入帧数从16扩展至128，并在五个视频理解基准测试中平均性能提升16%。我们的7B模型在同等规模模型中达到了最先进的性能。此外，慢-快架构采用即插即用设计，可集成到其他视频MLLMs中，以提升效率与可扩展性。

English

Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1) "fast" visual tokens -- a compact set of compressed video features -- are fed into the LLM alongside text embeddings to provide a quick overview; 2) "slow" visual tokens -- uncompressed video features -- are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation, and achieving a 16% average performance improvement across five video understanding benchmarks. Our 7B model achieves state-of-the-art performance among models of similar size. Furthermore, our slow-fast architecture is a plug-and-play design that can be integrated into other video MLLMs to improve efficiency and scalability.

视频多模态大语言模型的慢快架构

Slow-Fast Architecture for Video Multi-Modal Large Language Models

摘要

Summary

Support

Support