Q-Filters：利用QK几何结构实现高效的KV缓存压缩

摘要

自回归语言模型依赖于键值（KV）缓存，该缓存避免了在生成过程中重新计算过去的隐藏状态，从而加快了速度。随着模型规模和上下文长度的增长，KV缓存成为显著的内存瓶颈，这要求在生成过程中采用压缩方法来限制其大小。本文中，我们发现了查询（Q）和键（K）向量的惊人特性，使我们能够在不计算注意力图的情况下高效地近似注意力分数。我们提出了Q-Filters，一种无需训练的KV缓存压缩方法，它基于单一上下文无关的投影过滤掉不太关键的键值对。与许多替代方案不同，Q-Filters与FlashAttention兼容，因为它不需要直接访问注意力权重。在长上下文设置中的实验结果表明，Q-Filters在检索任务中与基于注意力的压缩方法（如SnapKV）竞争，同时在生成设置中始终优于高效的压缩方案（如Streaming-LLM）。值得注意的是，Q-Filters在“大海捞针”任务中以32倍压缩级别实现了99%的准确率，同时在文本生成中将生成困惑度下降减少了高达65%，相较于Streaming-LLM。

English

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Q-Filters：利用QK几何结构实现高效的KV缓存压缩

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

摘要

Summary

Support

Support