稀疏前沿：Transformer大型語言模型中的稀疏注意力權衡

摘要

稀疏注意力為擴展Transformer大型語言模型（LLMs）的長上下文處理能力提供了一種有前景的策略，但其可行性、效率與準確性的權衡，以及系統性的規模化研究仍未被充分探索。為填補這一空白，我們在不同模型規模、序列長度和稀疏度水平上，對無需訓練的稀疏注意力方法進行了細緻比較，涵蓋了多樣化的長序列任務——包括依賴自然語言但仍可控且易於評估的新任務。基於實驗，我們報告了一系列關鍵發現：1）isoFLOPS分析顯示，對於極長序列，更大且高度稀疏的模型優於更小且密集的模型。2）在解碼階段，能夠在統計上保證準確性保持的稀疏度水平高於預填充階段，且前者與模型規模相關。3）沒有一種策略能在所有任務和階段中表現最佳，不同場景需要不同的稀疏化單元或預算適應性。即便是中等稀疏度，也常常導致至少一個任務上的顯著性能下降，這表明稀疏注意力並非通用解決方案。4）我們提出並驗證了專門針對稀疏注意力的新穎規模化定律，提供了證據表明我們的發現很可能超越實驗範圍而成立。通過這些洞見，我們證明了稀疏注意力是增強Transformer LLMs處理更長序列能力的關鍵工具，但在性能敏感的應用中需要仔細評估其權衡。

English

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

稀疏前沿：Transformer大型語言模型中的稀疏注意力權衡

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

摘要

Summary

Support

Support