SageAttention2 技术报告:用于即插即用推理加速的准确 4 位注意力机制
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
November 17, 2024
作者: Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen
cs.AI
摘要
尽管量化线性层已被广泛使用,但其在加速注意力过程中的应用仍然有限。SageAttention 利用 8 位矩阵乘法、16 位矩阵乘法与 16 位累加器以及精度增强方法,实现了比 FlashAttention2 更准确且速度提升 2 倍的内核。为进一步提高注意力计算效率并保持精度,我们提出了 SageAttention2,它利用明显更快的 4 位矩阵乘法(Matmul)以及额外的精度增强技术。首先,我们建议将矩阵(Q、K)量化为 INT4,并以 warp 级别的粒度量化矩阵(widetilde P、V)为 FP8。其次,我们提出一种平滑 Q 和 V 的方法,增强了 INT4 QK 和 FP8 PV 的注意力准确性。第三,我们分析了各时间步和层的量化准确性,然后提出了一种自适应量化方法,以确保各种模型上的端到端指标。SageAttention2 的每秒操作次数(OPS)在 RTX4090 上分别超过 FlashAttention2 和 xformers 大约 3 倍和 5 倍。全面的实验证实,我们的方法在各种模型上,包括大型语言处理、图像生成和视频生成模型中,几乎没有引起端到端指标的损失。代码可在 https://github.com/thu-ml/SageAttention 上找到。
English
Although quantization for linear layers has been widely used, its application
to accelerate the attention process remains limited. SageAttention utilizes
8-bit matrix multiplication, 16-bit matrix multiplication with 16-bit
accumulator, and precision-enhancing methods, implementing an accurate and 2x
speedup kernel compared to FlashAttention2. To further enhance the efficiency
of attention computation while maintaining precision, we propose
SageAttention2, which utilizes significantly faster 4-bit matrix multiplication
(Matmul) alongside additional precision-enhancing techniques. First, we propose
to quantize matrixes (Q, K) to INT4 in a warp-level granularity and quantize
matrixes (widetilde P, V) to FP8. Second, we propose a method to smooth Q
and V, enhancing the accuracy of attention with INT4 QK and FP8 PV.
Third, we analyze the quantization accuracy across timesteps and layers, then
propose an adaptive quantization method to ensure the end-to-end metrics over
various models. The operations per second (OPS) of SageAttention2 surpass
FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively.
Comprehensive experiments confirm that our approach incurs negligible
end-to-end metrics loss across diverse models, including those for large
language processing, image generation, and video generation. The codes are
available at https://github.com/thu-ml/SageAttention.Summary
AI-Generated Summary