星形注意力：长序列上高效的LLM推断

摘要

基于Transformer的大型语言模型（LLMs）在处理长序列时，由于自注意力机制的二次复杂度，推理既昂贵又缓慢。我们引入Star Attention，这是一种两阶段的块稀疏逼近方法，通过将注意力分片到多个主机上并最小化通信开销，提高了计算效率。在第一阶段，通过跨主机的块状局部注意力并行处理上下文。在第二阶段，查询和响应标记通过序列全局注意力与所有先前缓存的标记进行关联。Star Attention与大多数使用全局注意力训练的基于Transformer的LLMs无缝集成，将内存需求和推理时间降低最多11倍，同时保持95-100%的准确性。

English

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

星形注意力：长序列上高效的LLM推断

Star Attention: Efficient LLM Inference over Long Sequences

摘要

Summary

Support

Support