ChunkKV: 효율적인 장문맥 LLM 추론을 위한 의미 보존 KV 캐시 압축

초록

대규모 언어 모델(LLM)을 사용한 장기 맥락 추론에서 메모리 비용을 줄이기 위해 최근 많은 연구들이 서로 다른 토큰의 키-값 (KV) 캐시를 압축하는 데 초점을 맞추고 있습니다. 그러나 우리는 이전의 KV 캐시 압축 방법이 토큰의 중요성을 개별적으로 측정하여 현실 세계 언어 특성에서 서로 다른 토큰 간의 종속성을 무시한다는 것을 확인했습니다. 이에 따라, 우리는 ChunkKV를 소개하여 한 덩어리의 토큰을 기본 압축 단위로 그룹화하고, 덜 중요한 것들을 버리면서 가장 정보가 풍부한 의미 청크를 유지합니다. 더 나아가, ChunkKV가 서로 다른 레이어 간에 보존된 인덱스에서 더 높은 유사성을 나타내는 것을 관찰하고, 계산 오버헤드를 더욱 줄이기 위해 레이어별 인덱스 재사용을 제안합니다. 우리는 LongBench와 Needle-In-A-HayStack을 포함한 최첨단 장기 맥락 벤치마크 및 GSM8K와 JailbreakV 인컨텍스트 학습 벤치마크에서 ChunkKV를 평가했습니다. O1 및 R1 LLMs에 대한 실험에서 기존 방법과 비교하여 공격적인 압축 비율로 최대 10\%의 성능 향상을 달성했습니다.

English

To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods.

ChunkKV: 효율적인 장문맥 LLM 추론을 위한 의미 보존 KV 캐시 압축

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

초록

Summary

Support