FastKV：KV 缓存压缩，用于快速处理长上下文，具有令牌选择性传播。

摘要

尽管大型语言模型（LLMs）擅长处理长上下文序列，但它们需要大量的键-值（KV）缓存来存储上下文信息，这可能会严重影响计算效率和内存使用。先前对这些KV缓存进行压缩的努力主要集中在减少内存需求，但在提高延迟方面存在局限性。为解决这一问题，我们引入了FastKV，这是一种旨在提高长上下文序列延迟的KV缓存压缩方法。为了提高处理速度同时保持准确性，FastKV采用了一种新颖的Token-Selective Propagation（TSP）方法，在LLMs的初始层保留完整的上下文信息，并且在更深层甚至在预填阶段仅选择性地传播部分信息。此外，FastKV还融合了基于分组查询注意力（GQA）的KV缓存压缩，以利用GQA在内存和计算效率方面的优势。我们的实验结果显示，与现有的KV缓存压缩方法HeadKV相比，FastKV在首个标记到达时间（TTFT）和吞吐量方面分别实现了2.00倍和1.40倍的改进。此外，FastKV成功地在长上下文基准测试中保持了与基准线可比的准确性水平。我们的代码可在https://github.com/dongwonjo/FastKV 上找到。

English

While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00times and 1.40times improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.

FastKV：KV 缓存压缩，用于快速处理长上下文，具有令牌选择性传播。

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

摘要

Summary

Support