星形注意力：在長序列上高效的LLM推論

摘要

在長序列上使用基於Transformer的大型語言模型（LLMs）進行推論是昂貴且緩慢的，這是由於自注意機制的二次複雜度所致。我們引入Star Attention，這是一種兩階段的塊稀疏近似，通過在多個主機之間分片注意力並最小化通信開銷來提高計算效率。在第一階段，通過跨主機的塊狀本地注意力並行處理上下文。在第二階段，查詢和響應標記通過序列全局注意力與所有先前緩存的標記進行關聯。Star Attention與大多數使用全局注意力訓練的基於Transformer的LLMs無縫集成，將記憶體需求和推論時間降低最多11倍，同時保持95-100%的準確性。

English

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

星形注意力：在長序列上高效的LLM推論

Star Attention: Efficient LLM Inference over Long Sequences

摘要

Support