ShadowKV:用于高吞吐量长上下文LLM推理的阴影KV缓存
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
October 28, 2024
作者: Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen
cs.AI
摘要
随着长上下文大型语言模型(LLMs)的广泛部署,对高吞吐推理的高效支持需求不断增长。然而,随着序列长度增加,关键-值(KV)缓存扩展,不仅会导致内存占用增加,还需要在生成每个标记时访问它,从而降低为长上下文LLMs提供服务时的吞吐量。虽然已经提出了各种动态稀疏注意力方法以加快推理速度同时保持生成质量,但它们要么无法充分减少GPU内存消耗,要么通过将KV缓存转移到CPU引入了显著的解码延迟。我们提出了ShadowKV,这是一个高吞吐长上下文LLM推理系统,它存储低秩键缓存并将值缓存卸载,以减少更大批次大小和更长序列的内存占用。为了最小化解码延迟,ShadowKV采用了一种准确的KV选择策略,可以动态重建最小稀疏KV对。通过在一系列基准测试中评估ShadowKV,包括RULER、LongBench和Needle In A Haystack,以及Llama-3.1-8B、Llama-3-8B-1M、GLM-4-9B-1M、Yi-9B-200K、Phi-3-Mini-128K和Qwen2-7B-128K等模型,我们证明它可以支持高达6倍的更大批次大小,并在A100 GPU上将吞吐量提高高达3.04倍,而不会牺牲准确性,甚至在假设GPU内存无限的情况下超越了无限批次大小所能实现的性能。代码可在https://github.com/bytedance/ShadowKV找到。
English
With the widespread deployment of long-context large language models (LLMs),
there has been a growing demand for efficient support of high-throughput
inference. However, as the key-value (KV) cache expands with the sequence
length, the increasing memory footprint and the need to access it for each
token generation both result in low throughput when serving long-context LLMs.
While various dynamic sparse attention methods have been proposed to speed up
inference while maintaining generation quality, they either fail to
sufficiently reduce GPU memory consumption or introduce significant decoding
latency by offloading the KV cache to the CPU. We present ShadowKV, a
high-throughput long-context LLM inference system that stores the low-rank key
cache and offloads the value cache to reduce the memory footprint for larger
batch sizes and longer sequences. To minimize decoding latency, ShadowKV
employs an accurate KV selection strategy that reconstructs minimal sparse KV
pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks,
including RULER, LongBench, and Needle In A Haystack, and models like
Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and
Qwen2-7B-128K, we demonstrate that it can support up to 6times larger batch
sizes and boost throughput by up to 3.04times on an A100 GPU without
sacrificing accuracy, even surpassing the performance achievable with infinite
batch size under the assumption of infinite GPU memory. The code is available
at https://github.com/bytedance/ShadowKV.Summary
AI-Generated Summary