MMInference：通過模態感知置換稀疏注意力加速長上下文視覺語言模型的預填充

摘要

長上下文能力與視覺理解的結合，為視覺語言模型（VLMs）開啟了前所未有的潛力。然而，預填充階段的二次方注意力複雜度仍是實際部署中的重大障礙。為克服這一限制，我們引入了MMInference（多模態百萬令牌推理），這是一種動態稀疏注意力方法，旨在加速長上下文多模態輸入的預填充階段。首先，我們的分析揭示，視頻輸入的時空局部性導致了一種獨特的稀疏模式——網格模式。同時，VLMs在不同模態間展現出顯著不同的稀疏分佈。我們引入了一種基於排列的方法，以利用獨特的網格模式並處理模態邊界問題。通過離線搜索每個頭部的最佳稀疏模式，MMInference根據輸入動態構建稀疏分佈。我們還提供了優化的GPU內核，以實現高效的稀疏計算。值得注意的是，MMInference無需任何模型修改或微調，即可無縫集成到現有的VLM管道中。在多模態基準測試（包括視頻問答、字幕生成、VisionNIAH和混合模態NIAH）上，使用最先進的長上下文VLMs（LongVila、LlavaVideo、VideoChat-Flash、Qwen2.5-VL）進行的實驗表明，MMInference在處理1M令牌時，預填充階段加速最高可達8.3倍，同時保持準確性。我們的代碼可在https://aka.ms/MMInference獲取。

English

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

MMInference：通過模態感知置換稀疏注意力加速長上下文視覺語言模型的預填充

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

摘要

Summary

Support

Support