LServe：基于统一稀疏注意力机制的高效长序列大语言模型服务

摘要

大型语言模型（LLMs）在处理长序列方面展现了显著潜力，然而，由于预填充阶段注意力机制的二次计算复杂度以及解码阶段键值（KV）缓存的大内存占用，高效服务这些长上下文模型仍具挑战。为解决这些问题，我们引入了LServe，一个通过混合稀疏注意力加速长序列LLM服务的高效系统。该方法将预填充和解码阶段的不同硬件友好型结构化稀疏模式统一到一个框架中，其中对重要性较低的令牌进行块级跳过计算。LServe展示了静态与动态稀疏性在长上下文LLM注意力机制中的兼容性。这一设计通过结合这些优化实现了乘法级的加速。具体而言，我们将预填充和解码阶段中一半的注意力头转换为近乎零成本的流式处理头。此外，我们发现无论上下文长度如何，仅需恒定数量的KV页面即可保持长上下文能力。随后，我们设计了一种基于查询中心相似性的分层KV页面选择策略，动态修剪KV页面。平均而言，LServe在保持长上下文准确性的同时，将LLM预填充速度提升至vLLM的2.9倍，解码速度提升1.3至2.1倍。代码已发布于https://github.com/mit-han-lab/omniserve。

English

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

LServe：基于统一稀疏注意力机制的高效长序列大语言模型服务

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

摘要

Summary

Support