SCBench: Un'Analisi Centrata sulla Cache KV dei Metodi a Lungo Contesto

Abstract

Le LLM a lungo contesto ha reso possibili numerose applicazioni derivate ma ha anche introdotto significativi problemi legati all'efficienza computazionale e di memoria. Per affrontare tali sfide, sono state sviluppate ottimizzazioni per l'inferenza a lungo contesto, incentrate sulla cache KV. Tuttavia, i benchmark esistenti valutano spesso singole richieste, trascurando il ciclo completo della cache KV nell'uso del mondo reale. Questa mancanza è particolarmente critica, poiché il riutilizzo della cache KV è diventato ampiamente adottato nei framework di inferenza LLM a lungo contesto, come vLLM e SGLang, nonché da fornitori di LLM come OpenAI, Microsoft, Google e Anthropic. Per colmare questa lacuna, presentiamo SCBench (SharedContextBench), un benchmark completo per valutare i metodi a lungo contesto da una prospettiva incentrata sulla cache KV: 1) generazione della cache KV, 2) compressione della cache KV, 3) recupero della cache KV, 4) caricamento della cache KV. In particolare, SCBench utilizza esempi di test con contesto condiviso, che coprono 12 compiti con due modalità di contesto condiviso, che includono quattro categorie di capacità a lungo contesto: recupero di stringhe, recupero semantico, informazioni globali e multi-task. Con esso, forniamo un'ampia analisi centrata sulla cache KV di otto categorie di soluzioni a lungo contesto, tra cui RNN lineari con gate, ibridi Mamba-Attention e metodi efficienti come attenzione sparsa, eliminazione della cache KV, quantizzazione, recupero, caricamento e compressione della richiesta. La valutazione è condotta su 8 LLM a lungo contesto. I nostri risultati mostrano che i metodi di memoria sub-O(n) soffrono in scenari multi-turno, mentre la codifica sparsa con memoria O(n) e calcolo di pre-riempimento sub-O(n^2) si comportano in modo robusto. La sparizione dinamica produce cache KV più espressive rispetto a pattern statici e la sparizione a livello di layer nelle architetture ibride riduce l'utilizzo della memoria con prestazioni elevate. Inoltre, identifichiamo problemi di spostamento della distribuzione dell'attenzione in scenari di generazione a lungo termine. https://aka.ms/SCBench.

English

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.

SCBench: Un'Analisi Centrata sulla Cache KV dei Metodi a Lungo Contesto

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Abstract

Support