SeerAttention：在您的LLM中學習內在稀疏注意力

摘要

注意力是現代大型語言模型（LLMs）的基石。然而，其二次複雜度限制了LLMs的效率和可擴展性，特別是對於具有長上下文窗口的模型。解決這一限制的一種有前途的方法是利用注意力中的稀疏性。然而，現有的基於稀疏性的解決方案主要依賴於預定義的模式或經驗法則來近似稀疏性。這種做法無法充分捕捉基於語言任務的注意力稀疏性的動態特性。本文認為，應該學習而不是預定義注意力的稀疏性。為此，我們設計了SeerAttention，一種新的注意力機制，它通過一個可學習的閘門來擴充傳統的注意力，該閘門能夠自適應地選擇注意力圖中的重要區塊，並將其餘區塊視為稀疏。這種區塊級稀疏性有效平衡了準確性和加速度。為了實現對閘門網絡的高效學習，我們開發了一個定制的FlashAttention實現，該實現以最小的開銷提取了注意力圖的區塊級地面真相。SeerAttention不僅適用於後訓練，而且在長上下文微調方面表現優異。我們的結果顯示，在後訓練階段，SeerAttention明顯優於最先進的基於靜態或啟發式的稀疏注意力方法，同時更具通用性和靈活性，以適應不同的上下文長度和稀疏比例。當應用於與YaRN的長上下文微調時，SeerAttention在32k上下文長度下可以實現顯著的90%稀疏比率，並且幾乎沒有困惑度損失，相比FlashAttention-2，加速了5.67倍。

English

Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.

SeerAttention：在您的LLM中學習內在稀疏注意力

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

摘要

Summary

Support

Support