ChunkKV：面向高效长上下文LLM推理的语义保留KV缓存压缩

摘要

为了减少大型语言模型（LLMs）中长上下文推理的内存成本，许多最近的研究侧重于压缩不同标记的键-值（KV）缓存。然而，我们发现先前的KV缓存压缩方法单独衡量标记的重要性，忽视了现实世界语言特征中不同标记之间的依赖关系。基于此，我们引入了ChunkKV，将块中的标记作为基本压缩单元，并保留最具信息量的语义块，同时丢弃不太重要的块。此外，观察到ChunkKV在保留的索引在不同层之间表现出更高的相似性，我们提出了逐层索引重用以进一步减少计算开销。我们在包括LongBench和Needle-In-A-HayStack在内的尖端长上下文基准测试以及GSM8K和JailbreakV上下文学习基准测试中评估了ChunkKV。我们对指令调优和多步推理（O1和R1）LLMs的实验表明，在与现有方法相比的激进压缩比下，性能提高高达10\%。

English

To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods.

ChunkKV：面向高效长上下文LLM推理的语义保留KV缓存压缩

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

摘要

Summary

Support