SeherAufmerksamkeit: Lernen von intrinsischer spärlicher Aufmerksamkeit in Ihren LLMs

Zusammenfassung

Die Aufmerksamkeit ist der Grundpfeiler moderner großer Sprachmodelle (LLMs). Dennoch begrenzt ihre quadratische Komplexität die Effizienz und Skalierbarkeit von LLMs, insbesondere für solche mit einem langen Kontextfenster. Ein vielversprechender Ansatz zur Bewältigung dieser Einschränkung besteht darin, die Sparsamkeit in der Aufmerksamkeit zu nutzen. Allerdings beruhen bestehende sparsamkeitsbasierte Lösungen überwiegend auf vordefinierten Mustern oder Heuristiken, um die Sparsamkeit anzunähern. Diese Praxis reicht nicht aus, um die dynamische Natur der Aufmerksamkeitssparsamkeit bei sprachbasierten Aufgaben vollständig zu erfassen. Dieser Artikel argumentiert, dass die Aufmerksamkeitssparsamkeit erlernt anstatt vordefiniert werden sollte. Zu diesem Zweck entwerfen wir SeerAttention, einen neuen Aufmerksamkeitsmechanismus, der die herkömmliche Aufmerksamkeit um ein erlernbares Gate erweitert, das adaptiv signifikante Blöcke in einer Aufmerksamkeitskarte auswählt und die übrigen Blöcke als sparsam betrachtet. Eine solche Sparsamkeit auf Blockebene balanciert Effizienz und Beschleunigung effektiv aus. Um das effiziente Lernen des Gate-Netzwerks zu ermöglichen, entwickeln wir eine maßgeschneiderte FlashAttention-Implementierung, die die blockweise Ground Truth der Aufmerksamkeitskarte mit minimalem Overhead extrahiert. SeerAttention ist nicht nur für das Post-Training geeignet, sondern glänzt auch beim Feintuning mit langem Kontext. Unsere Ergebnisse zeigen, dass SeerAttention in den Post-Training-Stadien signifikant besser abschneidet als modernste statische oder heuristisch basierte sparsame Aufmerksamkeitsmethoden, während es auch vielseitiger und flexibler ist, um sich an unterschiedliche Kontextlängen und Sparsamkeitsverhältnisse anzupassen. Wenn es beim Feintuning mit YaRN auf langen Kontext angewendet wird, kann SeerAttention bei einer Kontextlänge von 32k ein bemerkenswertes Sparsamkeitsverhältnis von 90% bei minimalem Perplexitätsverlust erreichen und bietet eine 5,67-fache Beschleunigung im Vergleich zu FlashAttention-2.

English

Attention is the cornerstone of modern Large Language Models (LLMs). Yet its quadratic complexity limits the efficiency and scalability of LLMs, especially for those with a long-context window. A promising approach addressing this limitation is to leverage the sparsity in attention. However, existing sparsity-based solutions predominantly rely on predefined patterns or heuristics to approximate sparsity. This practice falls short to fully capture the dynamic nature of attention sparsity in language-based tasks. This paper argues that attention sparsity should be learned rather than predefined. To this end, we design SeerAttention, a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. Such block-level sparsity effectively balances accuracy and speedup. To enable efficient learning of the gating network, we develop a customized FlashAttention implementation that extracts the block-level ground truth of attention map with minimum overhead. SeerAttention not only applies to post-training, but also excels in long-context fine-tuning. Our results show that at post-training stages, SeerAttention significantly outperforms state-of-the-art static or heuristic-based sparse attention methods, while also being more versatile and flexible to adapt to varying context lengths and sparsity ratios. When applied to long-context fine-tuning with YaRN, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.

SeherAufmerksamkeit: Lernen von intrinsischer spärlicher Aufmerksamkeit in Ihren LLMs

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Zusammenfassung

Support