稀疏前沿:Transformer大型語言模型中的稀疏注意力權衡
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
April 24, 2025
作者: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti
cs.AI
摘要
稀疏注意力為擴展Transformer大型語言模型(LLMs)的長上下文處理能力提供了一種有前景的策略,但其可行性、效率與準確性的權衡,以及系統性的規模化研究仍未被充分探索。為填補這一空白,我們在不同模型規模、序列長度和稀疏度水平上,對無需訓練的稀疏注意力方法進行了細緻比較,涵蓋了多樣化的長序列任務——包括依賴自然語言但仍可控且易於評估的新任務。基於實驗,我們報告了一系列關鍵發現:1)isoFLOPS分析顯示,對於極長序列,更大且高度稀疏的模型優於更小且密集的模型。2)在解碼階段,能夠在統計上保證準確性保持的稀疏度水平高於預填充階段,且前者與模型規模相關。3)沒有一種策略能在所有任務和階段中表現最佳,不同場景需要不同的稀疏化單元或預算適應性。即便是中等稀疏度,也常常導致至少一個任務上的顯著性能下降,這表明稀疏注意力並非通用解決方案。4)我們提出並驗證了專門針對稀疏注意力的新穎規模化定律,提供了證據表明我們的發現很可能超越實驗範圍而成立。通過這些洞見,我們證明了稀疏注意力是增強Transformer LLMs處理更長序列能力的關鍵工具,但在性能敏感的應用中需要仔細評估其權衡。
English
Sparse attention offers a promising strategy to extend long-context
capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy
trade-offs, and systematic scaling studies remain unexplored. To address this
gap, we perform a careful comparison of training-free sparse attention methods
at varying model scales, sequence lengths, and sparsity levels on a diverse
collection of long-sequence tasks-including novel ones that rely on natural
language while remaining controllable and easy to evaluate. Based on our
experiments, we report a series of key findings: 1) an isoFLOPS analysis
reveals that for very long sequences, larger and highly sparse models are
preferable to smaller and dense ones. 2) The level of sparsity attainable while
statistically guaranteeing accuracy preservation is higher during decoding than
prefilling, and correlates with model size in the former. 3) There is no clear
strategy that performs best across tasks and phases, with different units of
sparsification or budget adaptivity needed for different scenarios. Even
moderate sparsity levels often result in significant performance degradation on
at least one task, highlighting that sparse attention is not a universal
solution. 4) We introduce and validate novel scaling laws specifically tailored
for sparse attention, providing evidence that our findings are likely to hold
true beyond our range of experiments. Through these insights, we demonstrate
that sparse attention is a key tool to enhance the capabilities of Transformer
LLMs for processing longer sequences, but requires careful evaluation of
trade-offs for performance-sensitive applications.Summary
AI-Generated Summary