范围:在长上下文生成中优化键-值缓存压缩

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

December 18, 2024
作者: Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou
cs.AI

摘要

对于长上下文生成,键-值(KV)缓存已成为LLM的瓶颈。尽管在这一领域进行了大量努力,但通常忽略了解码阶段的优化。然而,我们认为这种优化至关重要,特别是针对基于以下两点观察的长输出生成任务:(i)在预填充阶段过度压缩,需要特定完整上下文会损害推理任务的理解;(ii)在具有长输出的推理任务中,重要元素的偏差会发生。因此,引入了SCOPE,这是一个简单而高效的框架,可在预填充和解码阶段分别执行KV缓存优化。具体而言,在预填充阶段保留KV缓存以保留基本信息,而基于滑动的新策略被提出以选择解码阶段的重要元素。还使用自适应和不连续策略进一步优化了内存使用和内存传输。在LongGenBench上进行的大量实验显示了SCOPE的有效性和泛化能力,以及其作为其他仅预填充的KV压缩方法的插件的兼容性。
English
Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.

Summary

AI-Generated Summary

PDF203December 23, 2024