SCBench:基于KV缓存的长上下文方法分析
SCBench: A KV Cache-Centric Analysis of Long-Context Methods
December 13, 2024
作者: Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
cs.AI
摘要
长上下文LLMs已经实现了许多下游应用,但也带来了与计算和内存效率相关的重大挑战。为了解决这些挑战,针对长上下文推理的优化已经被开发出来,主要集中在KV缓存周围。然而,现有的基准测试通常在单个请求中进行评估,忽略了KV缓存在真实世界使用中的完整生命周期。这种疏忽尤为关键,因为KV缓存重用已被广泛应用于LLMs推理框架,如vLLM和SGLang,以及LLM提供者,包括OpenAI、Microsoft、Google和Anthropic。为了弥补这一空白,我们引入了SCBench(SharedContextBench),这是一个全面评估长上下文方法的基准测试,从KV缓存为中心的角度出发:1)KV缓存生成,2)KV缓存压缩,3)KV缓存检索,4)KV缓存加载。具体而言,SCBench使用具有共享上下文的测试示例,涵盖12个任务,具有两种共享上下文模式,涵盖四类长上下文能力:字符串检索、语义检索、全局信息和多任务。通过SCBench,我们对包括门控线性RNNs、Mamba-Attention混合体以及稀疏注意力、KV缓存丢弃、量化、检索、加载和提示压缩等高效方法在内的八类长上下文解决方案进行了广泛的KV缓存中心分析。评估是在8个长上下文LLMs上进行的。我们的研究结果表明,次O(n)内存方法在多轮场景中表现不佳,而具有O(n)内存和次O(n^2)预填充计算的稀疏编码表现稳健。动态稀疏性比静态模式产生更具表现力的KV缓存,而混合架构中的层级稀疏性可减少内存使用并具有较强的性能。此外,我们在长生成场景中确定了注意力分布转移问题。https://aka.ms/SCBench.
English
Long-context LLMs have enabled numerous downstream applications but also
introduced significant challenges related to computational and memory
efficiency. To address these challenges, optimizations for long-context
inference have been developed, centered around the KV cache. However, existing
benchmarks often evaluate in single-request, neglecting the full lifecycle of
the KV cache in real-world use. This oversight is particularly critical, as KV
cache reuse has become widely adopted in LLMs inference frameworks, such as
vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft,
Google, and Anthropic. To address this gap, we introduce
SCBench(SharedContextBench), a comprehensive benchmark for evaluating
long-context methods from a KV cachecentric perspective: 1) KV cache
generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache
loading. Specifically, SCBench uses test examples with shared context, ranging
12 tasks with two shared context modes, covering four categories of
long-context capabilities: string retrieval, semantic retrieval, global
information, and multi-task. With it, we provide an extensive KV cache-centric
analysis of eight categories long-context solutions, including Gated Linear
RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention,
KV cache dropping, quantization, retrieval, loading, and prompt compression.
The evaluation is conducted on 8 long-context LLMs. Our findings show that
sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding
with O(n) memory and sub-O(n^2) pre-filling computation perform robustly.
Dynamic sparsity yields more expressive KV caches than static patterns, and
layer-level sparsity in hybrid architectures reduces memory usage with strong
performance. Additionally, we identify attention distribution shift issues in
long-generation scenarios. https://aka.ms/SCBench.Summary
AI-Generated Summary