SageAttention2 技术报告：用于即插即用推理加速的准确 4 位注意力机制

摘要

尽管量化线性层已被广泛使用，但其在加速注意力过程中的应用仍然有限。SageAttention 利用 8 位矩阵乘法、16 位矩阵乘法与 16 位累加器以及精度增强方法，实现了比 FlashAttention2 更准确且速度提升 2 倍的内核。为进一步提高注意力计算效率并保持精度，我们提出了 SageAttention2，它利用明显更快的 4 位矩阵乘法（Matmul）以及额外的精度增强技术。首先，我们建议将矩阵（Q、K）量化为 INT4，并以 warp 级别的粒度量化矩阵（widetilde P、V）为 FP8。其次，我们提出一种平滑 Q 和 V 的方法，增强了 INT4 QK 和 FP8 PV 的注意力准确性。第三，我们分析了各时间步和层的量化准确性，然后提出了一种自适应量化方法，以确保各种模型上的端到端指标。SageAttention2 的每秒操作次数（OPS）在 RTX4090 上分别超过 FlashAttention2 和 xformers 大约 3 倍和 5 倍。全面的实验证实，我们的方法在各种模型上，包括大型语言处理、图像生成和视频生成模型中，几乎没有引起端到端指标的损失。代码可在 https://github.com/thu-ml/SageAttention 上找到。

English

Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. SageAttention utilizes 8-bit matrix multiplication, 16-bit matrix multiplication with 16-bit accumulator, and precision-enhancing methods, implementing an accurate and 2x speedup kernel compared to FlashAttention2. To further enhance the efficiency of attention computation while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes (Q, K) to INT4 in a warp-level granularity and quantize matrixes (widetilde P, V) to FP8. Second, we propose a method to smooth Q and V, enhancing the accuracy of attention with INT4 QK and FP8 PV. Third, we analyze the quantization accuracy across timesteps and layers, then propose an adaptive quantization method to ensure the end-to-end metrics over various models. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at https://github.com/thu-ml/SageAttention.

SageAttention2 技术报告：用于即插即用推理加速的准确 4 位注意力机制

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

摘要

Summary

Support