SageAttention:準確的8位元注意力機制用於即插即用推論加速。
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
October 3, 2024
作者: Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen
cs.AI
摘要
Transformer 架構在各種模型中占主導地位。作為 Transformer 的核心,注意力機制的計算複雜度為 O(N^2),而線性轉換為 O(N)。在處理大序列長度時,注意力機制成為主要耗時組件。儘管量化已被證明是加速模型推斷的有效方法,現有的量化方法主要集中在優化線性層。為此,我們首先詳細分析了在注意力機制中進行量化的可行性。隨後,我們提出了 SageAttention,這是一種高效且準確的注意力機制量化方法。我們的方法的每秒操作數(OPS)優於 FlashAttention2 和 xformers 約 2.1 倍和 2.7 倍。SageAttention 在準確性能上也優於 FlashAttention3。全面的實驗證實,我們的方法在各種模型上幾乎不會造成端到端指標的損失,包括用於大型語言處理、圖像生成和視頻生成的模型。
English
The transformer architecture predominates across various models. As the heart
of the transformer, attention has a computational complexity of O(N^2),
compared to O(N) for linear transformations. When handling large sequence
lengths, attention becomes the primary time-consuming component. Although
quantization has proven to be an effective method for accelerating model
inference, existing quantization methods primarily focus on optimizing the
linear layer. In response, we first analyze the feasibility of quantization in
attention detailedly. Following that, we propose SageAttention, a highly
efficient and accurate quantization method for attention. The OPS (operations
per second) of our approach outperforms FlashAttention2 and xformers by about
2.1 times and 2.7 times, respectively. SageAttention also achieves superior
accuracy performance over FlashAttention3. Comprehensive experiments confirm
that our approach incurs almost no end-to-end metrics loss across diverse
models, including those for large language processing, image generation, and
video generation.Summary
AI-Generated Summary