SpargeAttn:精准稀疏注意力机制,加速任意模型推理
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
February 25, 2025
作者: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
cs.AI
摘要
由于注意力机制具有二次时间复杂度,高效的注意力实现对于大型模型至关重要。幸运的是,注意力机制通常表现出稀疏性,即注意力映射中的许多值接近零,这使得可以省略相应的计算。许多研究已利用这种稀疏模式来加速注意力计算。然而,现有工作大多专注于通过利用注意力映射的特定稀疏模式来优化特定模型内的注意力计算。一种既能保证加速又能确保多种模型端到端性能的通用稀疏注意力机制仍难以实现。本文中,我们提出了SpargeAttn,一种适用于任何模型的通用稀疏量化注意力机制。我们的方法采用了两阶段在线过滤:第一阶段,我们快速且准确地预测注意力映射,从而跳过注意力计算中的部分矩阵乘法;第二阶段,我们设计了一种无额外开销的在线softmax感知过滤器,进一步跳过部分矩阵乘法。实验表明,我们的方法显著加速了包括语言、图像和视频生成在内的多种模型,且未牺牲端到端指标。代码已发布于https://github.com/thu-ml/SpargeAttn。
English
An efficient attention implementation is essential for large models due to
its quadratic time complexity. Fortunately, attention commonly exhibits
sparsity, i.e., many values in the attention map are near zero, allowing for
the omission of corresponding computations. Many studies have utilized the
sparse pattern to accelerate attention. However, most existing works focus on
optimizing attention within specific models by exploiting certain sparse
patterns of the attention map. A universal sparse attention that guarantees
both the speedup and end-to-end performance of diverse models remains elusive.
In this paper, we propose SpargeAttn, a universal sparse and quantized
attention for any model. Our method uses a two-stage online filter: in the
first stage, we rapidly and accurately predict the attention map, enabling the
skip of some matrix multiplications in attention. In the second stage, we
design an online softmax-aware filter that incurs no extra overhead and further
skips some matrix multiplications. Experiments show that our method
significantly accelerates diverse models, including language, image, and video
generation, without sacrificing end-to-end metrics. The codes are available at
https://github.com/thu-ml/SpargeAttn.Summary
AI-Generated Summary