ShadowKV: 고 처리량 장거리 컨텍스트 LLM 추론을 위한 그림자 내 KV 캐시

초록

긴 문맥의 대형 언어 모델 (LLM)이 널리 배포되면서 고처리량 추론을 효율적으로 지원하는 수요가 증가하고 있습니다. 그러나 키-값 (KV) 캐시가 시퀀스 길이와 함께 확장됨에 따라 증가하는 메모리 풋프린트와 각 토큰 생성 시에 액세스해야 하는 필요성은 긴 문맥의 LLM을 제공할 때 저 처리량으로 이어집니다. 다양한 동적 희소 어텐션 방법이 제안되었지만, 이들은 GPU 메모리 소비를 충분히 줄이지 못하거나 KV 캐시를 CPU로 오프로드하여 디코딩 지연을 도입하는 문제가 있습니다. 저희는 ShadowKV를 제시합니다. 이는 저 메모리 풋프린트를 줄이기 위해 저랭크 키 캐시를 저장하고 값 캐시를 오프로드하는 고처리량의 긴 문맥 LLM 추론 시스템입니다. 디코딩 지연을 최소화하기 위해 ShadowKV는 정확한 KV 선택 전략을 채택하여 필요한 최소한의 희소 KV 쌍을 실시간으로 재구성합니다. RULER, LongBench, Needle In A Haystack를 비롯한 다양한 벤치마크 및 Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, Qwen2-7B-128K와 같은 모델에서 ShadowKV를 평가함으로써, 무한한 GPU 메모리를 전제로 한 무한한 배치 크기에서 가능한 성능을 능가하면서도 정확도를 희생하지 않고 A100 GPU에서 최대 6배 큰 배치 크기를 지원하고 처리량을 최대 3.04배 향상시킬 수 있음을 입증합니다. 코드는 https://github.com/bytedance/ShadowKV에서 확인하실 수 있습니다.

English

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6times larger batch sizes and boost throughput by up to 3.04times on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at https://github.com/bytedance/ShadowKV.

ShadowKV: 고 처리량 장거리 컨텍스트 LLM 추론을 위한 그림자 내 KV 캐시

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

초록

Support