SageAttention：準確的8位元注意力機制用於即插即用推論加速。

摘要

Transformer 架構在各種模型中占主導地位。作為 Transformer 的核心，注意力機制的計算複雜度為 O(N^2)，而線性轉換為 O(N)。在處理大序列長度時，注意力機制成為主要耗時組件。儘管量化已被證明是加速模型推斷的有效方法，現有的量化方法主要集中在優化線性層。為此，我們首先詳細分析了在注意力機制中進行量化的可行性。隨後，我們提出了 SageAttention，這是一種高效且準確的注意力機制量化方法。我們的方法的每秒操作數（OPS）優於 FlashAttention2 和 xformers 約 2.1 倍和 2.7 倍。SageAttention 在準確性能上也優於 FlashAttention3。全面的實驗證實，我們的方法在各種模型上幾乎不會造成端到端指標的損失，包括用於大型語言處理、圖像生成和視頻生成的模型。

English

The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.

SageAttention：準確的8位元注意力機制用於即插即用推論加速。

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

摘要

Summary

Support

Support