LServe:基于统一稀疏注意力机制的高效长序列大语言模型服务
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
February 20, 2025
作者: Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
cs.AI
摘要
大型语言模型(LLMs)在处理长序列方面展现了显著潜力,然而,由于预填充阶段注意力机制的二次计算复杂度以及解码阶段键值(KV)缓存的大内存占用,高效服务这些长上下文模型仍具挑战。为解决这些问题,我们引入了LServe,一个通过混合稀疏注意力加速长序列LLM服务的高效系统。该方法将预填充和解码阶段的不同硬件友好型结构化稀疏模式统一到一个框架中,其中对重要性较低的令牌进行块级跳过计算。LServe展示了静态与动态稀疏性在长上下文LLM注意力机制中的兼容性。这一设计通过结合这些优化实现了乘法级的加速。具体而言,我们将预填充和解码阶段中一半的注意力头转换为近乎零成本的流式处理头。此外,我们发现无论上下文长度如何,仅需恒定数量的KV页面即可保持长上下文能力。随后,我们设计了一种基于查询中心相似性的分层KV页面选择策略,动态修剪KV页面。平均而言,LServe在保持长上下文准确性的同时,将LLM预填充速度提升至vLLM的2.9倍,解码速度提升1.3至2.1倍。代码已发布于https://github.com/mit-han-lab/omniserve。
English
Large language models (LLMs) have shown remarkable potential in processing
long sequences, yet efficiently serving these long-context models remains
challenging due to the quadratic computational complexity of attention in the
prefilling stage and the large memory footprint of the KV cache in the decoding
stage. To address these issues, we introduce LServe, an efficient system that
accelerates long-sequence LLM serving via hybrid sparse attention. This method
unifies different hardware-friendly, structured sparsity patterns for both
prefilling and decoding attention into a single framework, where computations
on less important tokens are skipped block-wise. LServe demonstrates the
compatibility of static and dynamic sparsity in long-context LLM attention.
This design enables multiplicative speedups by combining these optimizations.
Specifically, we convert half of the attention heads to nearly free streaming
heads in both the prefilling and decoding stages. Additionally, we find that
only a constant number of KV pages is required to preserve long-context
capabilities, irrespective of context length. We then design a hierarchical KV
page selection policy that dynamically prunes KV pages based on query-centric
similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and
decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is
released at https://github.com/mit-han-lab/omniserve.Summary
AI-Generated Summary