MMInference:通過模態感知置換稀疏注意力加速長上下文視覺語言模型的預填充
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
April 22, 2025
作者: Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
cs.AI
摘要
長上下文能力與視覺理解的結合,為視覺語言模型(VLMs)開啟了前所未有的潛力。然而,預填充階段的二次方注意力複雜度仍是實際部署中的重大障礙。為克服這一限制,我們引入了MMInference(多模態百萬令牌推理),這是一種動態稀疏注意力方法,旨在加速長上下文多模態輸入的預填充階段。首先,我們的分析揭示,視頻輸入的時空局部性導致了一種獨特的稀疏模式——網格模式。同時,VLMs在不同模態間展現出顯著不同的稀疏分佈。我們引入了一種基於排列的方法,以利用獨特的網格模式並處理模態邊界問題。通過離線搜索每個頭部的最佳稀疏模式,MMInference根據輸入動態構建稀疏分佈。我們還提供了優化的GPU內核,以實現高效的稀疏計算。值得注意的是,MMInference無需任何模型修改或微調,即可無縫集成到現有的VLM管道中。在多模態基準測試(包括視頻問答、字幕生成、VisionNIAH和混合模態NIAH)上,使用最先進的長上下文VLMs(LongVila、LlavaVideo、VideoChat-Flash、Qwen2.5-VL)進行的實驗表明,MMInference在處理1M令牌時,預填充階段加速最高可達8.3倍,同時保持準確性。我們的代碼可在https://aka.ms/MMInference獲取。
English
The integration of long-context capabilities with visual understanding
unlocks unprecedented potential for Vision Language Models (VLMs). However, the
quadratic attention complexity during the pre-filling phase remains a
significant obstacle to real-world deployment. To overcome this limitation, we
introduce MMInference (Multimodality Million tokens Inference), a dynamic
sparse attention method that accelerates the prefilling stage for long-context
multi-modal inputs. First, our analysis reveals that the temporal and spatial
locality of video input leads to a unique sparse pattern, the Grid pattern.
Simultaneously, VLMs exhibit markedly different sparse distributions across
different modalities. We introduce a permutation-based method to leverage the
unique Grid pattern and handle modality boundary issues. By offline search the
optimal sparse patterns for each head, MMInference constructs the sparse
distribution dynamically based on the input. We also provide optimized GPU
kernels for efficient sparse computations. Notably, MMInference integrates
seamlessly into existing VLM pipelines without any model modifications or
fine-tuning. Experiments on multi-modal benchmarks-including Video QA,
Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art
long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that
MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while
maintaining accuracy. Our code is available at https://aka.ms/MMInference.Summary
AI-Generated Summary