本地稀疏注意力：硬件对齐和本地可训练稀疏注意力

摘要

长上下文建模对于下一代语言模型至关重要，然而标准注意力机制的高计算成本带来了显著的计算挑战。稀疏注意力为提高效率同时保持模型能力提供了一个有前途的方向。我们提出了NSA，一种本地可训练的稀疏注意力机制，它将算法创新与硬件对齐的优化相结合，实现了高效的长上下文建模。NSA采用动态分层稀疏策略，将粗粒度的标记压缩与细粒度的标记选择相结合，以保留全局上下文意识和局部精度。我们的方法通过两个关键创新推进了稀疏注意力设计：(1) 我们通过算术强度平衡的算法设计实现了显著的加速，同时针对现代硬件进行了实现优化。(2) 我们实现了端到端训练，减少了预训练计算而不损害模型性能。如图1所示，实验证明，使用NSA预训练的模型在一般基准、长上下文任务和基于指令的推理方面保持或超越了全注意力模型。与此同时，NSA在64k长度序列的解码、前向传播和反向传播中实现了显著的加速，验证了其在整个模型生命周期中的高效性。

English

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

本地稀疏注意力：硬件对齐和本地可训练稀疏注意力

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

摘要

Summary

Support