ChatPaper.aiChatPaper

xKV:面向KV缓存压缩的跨层奇异值分解

xKV: Cross-Layer SVD for KV-Cache Compression

March 24, 2025
作者: Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
cs.AI

摘要

具备长上下文窗口的大型语言模型(LLMs)虽能实现强大应用,却需付出高内存消耗的代价,以存储键值状态(KV缓存)。近期研究尝试将多层KV缓存合并为共享表示,但这些方法要么需要昂贵的预训练,要么依赖于层间高余弦相似度的假设,而这一假设在实践中往往不成立。我们发现,KV缓存中主导的奇异向量在多层间表现出极佳的对齐性。基于这一洞察,我们提出了xKV,一种简单的训练后方法,对分组层的KV缓存应用奇异值分解(SVD)。xKV将多层KV缓存整合到一个共享的低秩子空间中,显著减少了KV缓存的大小。通过在RULER长上下文基准测试上对广泛使用的LLMs(如Llama-3.1和Qwen2.5)进行广泛评估,xKV实现了比最先进的层间技术高达6.8倍的压缩率,同时准确率提升了2.7%。此外,xKV与新兴的多头潜在注意力机制(MLA,如DeepSeek-Coder-V2)兼容,在编码任务上实现了显著的3倍压缩率,且无性能损失。这些结果凸显了xKV在解决长上下文LLM推理内存瓶颈方面的强大能力和多功能性。我们的代码已公开于:https://github.com/abdelfattah-lab/xKV。
English
Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

Summary

AI-Generated Summary

PDF41March 26, 2025