SCBench:一項以鍵值快取為中心的長內容方法分析
SCBench: A KV Cache-Centric Analysis of Long-Context Methods
December 13, 2024
作者: Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
cs.AI
摘要
長內文LLMs已為眾多下游應用帶來便利,但也引入了與計算和記憶體效率相關的重大挑戰。為應對這些挑戰,已開發了針對長內文推論的優化方案,以KV快取為中心。然而,現有基準測試通常僅評估單一請求,忽略了KV快取在實際使用中的完整生命週期。這種疏忽尤為關鍵,因為KV快取重複使用已被廣泛應用於LLMs推論框架,如vLLM和SGLang,以及LLM提供者,包括OpenAI、Microsoft、Google和Anthropic。為彌補這一缺口,我們引入了SCBench(SharedContextBench),這是一個全面評估長內文方法的基準測試,從KV快取為中心的角度出發:1)KV快取生成,2)KV快取壓縮,3)KV快取檢索,4)KV快取加載。具體而言,SCBench使用具有共享內文的測試示例,涵蓋12個任務和兩種共享內文模式,涵蓋四個長內文能力類別:字符串檢索、語義檢索、全局信息和多任務。通過SCBench,我們提供了對八個長內文解決方案的廣泛KV快取中心分析,包括閘控線性RNN、Mamba-Attention混合型以及高效方法,如稀疏注意力、KV快取丟棄、量化、檢索、加載和提示壓縮。評估是在8個長內文LLMs上進行的。我們的研究結果表明,次O(n)記憶體方法在多輪情境中表現不佳,而具O(n)記憶體和次O(n^2)預填充計算的稀疏編碼表現穩健。動態稀疏性比靜態模式產生更具表現力的KV快取,混合架構中的層級稀疏性可降低記憶體使用率並具有較強的性能。此外,我們在長生成情境中識別了注意力分佈轉移問題。https://aka.ms/SCBench。
English
Long-context LLMs have enabled numerous downstream applications but also
introduced significant challenges related to computational and memory
efficiency. To address these challenges, optimizations for long-context
inference have been developed, centered around the KV cache. However, existing
benchmarks often evaluate in single-request, neglecting the full lifecycle of
the KV cache in real-world use. This oversight is particularly critical, as KV
cache reuse has become widely adopted in LLMs inference frameworks, such as
vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft,
Google, and Anthropic. To address this gap, we introduce
SCBench(SharedContextBench), a comprehensive benchmark for evaluating
long-context methods from a KV cachecentric perspective: 1) KV cache
generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache
loading. Specifically, SCBench uses test examples with shared context, ranging
12 tasks with two shared context modes, covering four categories of
long-context capabilities: string retrieval, semantic retrieval, global
information, and multi-task. With it, we provide an extensive KV cache-centric
analysis of eight categories long-context solutions, including Gated Linear
RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention,
KV cache dropping, quantization, retrieval, loading, and prompt compression.
The evaluation is conducted on 8 long-context LLMs. Our findings show that
sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding
with O(n) memory and sub-O(n^2) pre-filling computation perform robustly.
Dynamic sparsity yields more expressive KV caches than static patterns, and
layer-level sparsity in hybrid architectures reduces memory usage with strong
performance. Additionally, we identify attention distribution shift issues in
long-generation scenarios. https://aka.ms/SCBench.Summary
AI-Generated Summary