범위: 장문 생성에서 키-값 캐시 압축 최적화

초록

키-값 (KV) 캐시는 장기 맥락 생성을 위한 LLMs의 병목 현상이 되었습니다. 이 분야에서의 다양한 노력에도 불구하고, 디코딩 단계의 최적화는 일반적으로 무시됩니다. 그러나 우리는 이러한 최적화가 중요하다고 믿습니다, 특히 다음 두 가지 관찰을 기반으로 한 장기 출력 생성 작업에 있어서: (i) 프리필 단계 중 과도한 압축은 특정 전체 맥락을 필요로 하는 이해 작업을 손상시킵니다; (ii) 장기 출력을 가진 추론 작업에서 중요한 요소의 이탈이 발생합니다. 따라서, SCOPE는 프리필 및 디코딩 단계에서 별도로 KV 캐시 최적화를 수행하는 간단하면서 효율적인 프레임워크로 소개됩니다. 구체적으로, 프리필 단계에서의 KV 캐시는 필수 정보를 유지하기 위해 보존되며, 디코딩 단계를 위해 필수적인 중요한 요소를 선택하기 위한 슬라이딩을 기반으로 한 새로운 전략이 제안됩니다. 메모리 사용량 및 메모리 전송은 적응 및 불연속 전략을 사용하여 추가로 최적화됩니다. LongGenBench에서의 포괄적인 실험은 SCOPE의 효과성과 일반화 능력, 그리고 다른 프리필 전용 KV 압축 방법에 대한 플러그인으로서의 호환성을 보여줍니다.

English

Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.

범위: 장문 생성에서 키-값 캐시 압축 최적화

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

초록

Summary

Support

Support