ShadowKV：用于高吞吐量长上下文LLM推理的阴影KV缓存

摘要

随着长上下文大型语言模型（LLMs）的广泛部署，对高吞吐推理的高效支持需求不断增长。然而，随着序列长度增加，关键-值（KV）缓存扩展，不仅会导致内存占用增加，还需要在生成每个标记时访问它，从而降低为长上下文LLMs提供服务时的吞吐量。虽然已经提出了各种动态稀疏注意力方法以加快推理速度同时保持生成质量，但它们要么无法充分减少GPU内存消耗，要么通过将KV缓存转移到CPU引入了显著的解码延迟。我们提出了ShadowKV，这是一个高吞吐长上下文LLM推理系统，它存储低秩键缓存并将值缓存卸载，以减少更大批次大小和更长序列的内存占用。为了最小化解码延迟，ShadowKV采用了一种准确的KV选择策略，可以动态重建最小稀疏KV对。通过在一系列基准测试中评估ShadowKV，包括RULER、LongBench和Needle In A Haystack，以及Llama-3.1-8B、Llama-3-8B-1M、GLM-4-9B-1M、Yi-9B-200K、Phi-3-Mini-128K和Qwen2-7B-128K等模型，我们证明它可以支持高达6倍的更大批次大小，并在A100 GPU上将吞吐量提高高达3.04倍，而不会牺牲准确性，甚至在假设GPU内存无限的情况下超越了无限批次大小所能实现的性能。代码可在https://github.com/bytedance/ShadowKV找到。

English

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6times larger batch sizes and boost throughput by up to 3.04times on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.

ShadowKV：用于高吞吐量长上下文LLM推理的阴影KV缓存

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

摘要

Summary

Support

Support