範圍:在長內容生成中優化關鍵-值緩存壓縮
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
December 18, 2024
作者: Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou
cs.AI
摘要
Key-Value(KV)緩存已成為長內容生成的LLMs的瓶頸。儘管在這個領域進行了大量努力,但通常忽略了對解碼階段的優化。然而,我們認為這種優化至關重要,特別是對於基於以下兩個觀察結果的長輸出生成任務:(i)在預填充階段過度壓縮,需要特定完整上下文會損害推理任務的理解;(ii)在具有長輸出的推理任務中,重要內容的偏差發生。因此,我們引入了SCOPE,這是一個簡單而高效的框架,可以在預填充和解碼階段分別執行KV緩存優化。具體而言,在預填充階段保留KV緩存以保持基本信息,同時提出了一種基於滑動的新策略,用於選擇解碼階段的重要內容。通過使用自適應和不連續策略進一步優化了內存使用和內存傳輸。在LongGenBench上進行的大量實驗顯示了SCOPE的有效性和泛化性,以及其作為其他僅限於預填充的KV壓縮方法的插件的兼容性。
English
Key-Value (KV) cache has become a bottleneck of LLMs for long-context
generation. Despite the numerous efforts in this area, the optimization for the
decoding phase is generally ignored. However, we believe such optimization is
crucial, especially for long-output generation tasks based on the following two
observations: (i) Excessive compression during the prefill phase, which
requires specific full context impairs the comprehension of the reasoning task;
(ii) Deviation of heavy hitters occurs in the reasoning tasks with long
outputs. Therefore, SCOPE, a simple yet efficient framework that separately
performs KV cache optimization during the prefill and decoding phases, is
introduced. Specifically, the KV cache during the prefill phase is preserved to
maintain the essential information, while a novel strategy based on sliding is
proposed to select essential heavy hitters for the decoding phase. Memory usage
and memory transfer are further optimized using adaptive and discontinuous
strategies. Extensive experiments on LongGenBench show the effectiveness and
generalization of SCOPE and its compatibility as a plug-in to other
prefill-only KV compression methods.Summary
AI-Generated Summary