LLM이 KV 캐시 압축 하에서 기본적인 능력을 유지할 수 있을까요?

초록

본 논문은 대규모 언어 모델(LLMs)에서 미개척된 과제를 조사합니다: KV 캐시 압축 방법이 LLMs의 기본 능력에 미치는 영향에 대해. 기존 방법들은 긴 문맥 벤치마크에서 인상적인 압축 비율을 달성하지만, 핵심 모델 능력에 미치는 영향은 미연구된 상태입니다. 우리는 세계 지식, 상식적 추론, 산술적 추론, 코드 생성, 안전성, 그리고 긴 문맥 이해와 생성을 포괄하는 다양한 작업을 횡단하는 주요 KV 캐시 압축 방법을 평가하는 포괄적인 경험적 연구를 제시합니다. 우리의 분석 결과, KV 캐시 압축 방법은 작업별 성능 저하를 나타냅니다. 산술적 추론 작업은 특히 공격적인 압축에 민감하며, 서로 다른 방법들은 성능 저하율이 17.4%에서 43.3%까지 나타냅니다. 특히, DeepSeek R1 Distill 모델은 지시에 맞춘 모델들과 비교하여 더 견고한 압축 허용성을 보여주며, 성능 저하가 단 9.67%에서 25.53%에 머무릅니다. 우리의 주의 집중 패턴 및 작업 간 압축 성능 분석을 기반으로, 우리는 ShotKV를 제안합니다. ShotKV는 사전 채우기 및 디코딩 단계를 명확히 처리하면서 샷 수준의 의미론적 일관성을 유지하는 새로운 압축 접근 방식입니다. 경험적 결과는 ShotKV가 공격적인 압축 비율 하에서 긴 문맥 생성 작업에서 9%에서 18% 성능 향상을 달성한다는 것을 보여줍니다.

English

This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of 17.4%-43.3%. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only 9.67%-25.53% performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9%-18% performance improvements on long-context generation tasks under aggressive compression ratios.

LLM이 KV 캐시 압축 하에서 기본적인 능력을 유지할 수 있을까요?

Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

초록

Support