範囲：長文脈生成におけるキー・値キャッシュ圧縮の最適化

要旨

Key-Value（KV）キャッシュは、長いコンテキスト生成のLLMsにおいてボトルネックとなっています。この領域での数々の取り組みにもかかわらず、デコーディングフェーズの最適化は一般的に無視されています。しかしながら、我々はそのような最適化が重要であると考えており、特に次の2つの観察に基づく長い出力生成タスクにおいては重要です：（i）プリフィルフェーズ中の過剰な圧縮は、特定の完全なコンテキストが推論タスクの理解を損ないます；（ii）長い出力を伴う推論タスクにおいて、ヘビーヒッターの逸脱が発生します。そのため、SCOPEというシンプルかつ効率的なフレームワークが導入されました。このフレームワークは、プリフィルとデコーディングフェーズそれぞれでKVキャッシュの最適化を別々に行います。具体的には、プリフィルフェーズ中のKVキャッシュは重要な情報を維持するために保持され、デコーディングフェーズではスライディングに基づく新しい戦略が提案され、重要なヘビーヒッターを選択します。メモリ使用量とメモリ転送は、適応的および不連続な戦略を使用してさらに最適化されます。LongGenBenchでの包括的な実験により、SCOPEの効果と汎化性、および他のプリフィル専用KV圧縮方法へのプラグインとしての互換性が示されました。

English

Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.

範囲：長文脈生成におけるキー・値キャッシュ圧縮の最適化

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

要旨

Summary

Support

Support