ShadowKV：用於高吞吐量長內容LLM推論的影子KV快取

摘要

隨著長內容大型語言模型（LLMs）的廣泛部署，對高通量推論的有效支持需求日益增加。然而，隨著序列長度擴展，關鍵-值（KV）快取的擴展導致記憶體佔用增加，並且在為每個標記生成時需要訪問它，這都導致在為長內容LLMs提供服務時通量降低。雖然提出了各種動態稀疏注意力方法以加快推論速度同時保持生成質量，但它們要麼無法充分減少GPU記憶體消耗，要麼通過將KV快取卸載到CPU而引入顯著的解碼延遲。我們提出了ShadowKV，一個高通量長內容LLM推論系統，該系統存儲低秩鍵快取並卸載值快取，以減少較大批次大小和較長序列的記憶體佔用。為了最小化解碼延遲，ShadowKV採用了一種準確的KV選擇策略，可以即時重建最小稀疏KV對。通過在一系列基準測試中評估ShadowKV，包括RULER、LongBench和Needle In A Haystack，以及像Llama-3.1-8B、Llama-3-8B-1M、GLM-4-9B-1M、Yi-9B-200K、Phi-3-Mini-128K和Qwen2-7B-128K等模型，我們證明它可以支持高達6倍的較大批次大小，並在A100 GPU上將通量提高高達3.04倍，而不會犧牲準確性，甚至在假設無限GPU記憶體的情況下，也能超越無限批次大小下可實現的性能。代碼可在https://github.com/bytedance/ShadowKV 上找到。

English

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6times larger batch sizes and boost throughput by up to 3.04times on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.

ShadowKV：用於高吞吐量長內容LLM推論的影子KV快取

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

摘要

Summary

Support

Support