SCBench: 장문맥 방법의 KV 캐시 중심 분석

초록

긴 문맥 LLM(Long-context Language Models)은 다양한 하위 응용 프로그램을 가능케 했지만, 계산 및 메모리 효율성과 관련된 중요한 도전 과제를 도입했습니다. 이러한 도전에 대응하기 위해, 긴 문맥 추론을 위한 최적화가 개발되었는데, 이는 KV 캐시를 중심으로 이루어졌습니다. 그러나 기존의 벤치마크는 종종 단일 요청에서 평가되어 실제 사용에서의 KV 캐시의 전체 수명주기를 간과합니다. 이러한 간과는 특히 KV 캐시 재사용이 널리 채택되고 있는 vLLM과 SGLang과 같은 LLM 추론 프레임워크 및 OpenAI, Microsoft, Google, Anthropic을 포함한 LLM 제공 업체에서 중요합니다. 이 간극을 해결하기 위해 우리는 SCBench(SharedContextBench)를 소개합니다. 이는 KV 캐시 중심 관점에서 긴 문맥 방법을 평가하기 위한 포괄적인 벤치마크입니다. 이는 1) KV 캐시 생성, 2) KV 캐시 압축, 3) KV 캐시 검색, 4) KV 캐시 로딩을 중심으로 합니다. 구체적으로 SCBench는 두 가지 공유 문맥 모드를 갖는 12가지 작업을 포함하는 테스트 예제를 사용하며, 문자열 검색, 의미 검색, 전역 정보, 그리고 다중 작업이라는 네 가지 범주의 긴 문맥 기능을 다룹니다. 우리는 Gated Linear RNNs, Mamba-Attention 하이브리드, 희소 어텐션, KV 캐시 삭제, 양자화, 검색, 로딩, 그리고 프롬프트 압축과 같은 효율적인 방법을 포함한 여덟 가지 긴 문맥 솔루션에 대한 포괄적인 KV 캐시 중심 분석을 제공합니다. 평가는 8가지의 긴 문맥 LLM에서 수행되었습니다. 우리의 연구 결과는 sub-O(n) 메모리 방법이 다중 턴 시나리오에서 고통을 겪는 반면, O(n) 메모리와 sub-O(n^2) 사전 채우기 계산을 갖는 희소 인코딩이 견고하게 수행된다는 것을 보여줍니다. 동적 희소성은 정적 패턴보다 표현력이 뛰어난 KV 캐시를 제공하며, 하이브리드 아키텍처에서의 레이어 수준 희소성은 강력한 성능과 함께 메모리 사용량을 줄입니다. 또한 우리는 긴 생성 시나리오에서 어텐션 분포 이동 문제를 식별했습니다. https://aka.ms/SCBench.

English

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.

SCBench: 장문맥 방법의 KV 캐시 중심 분석

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

초록

Support