DuoAttention: 검색 및 스트리밍 헤드를 활용한 효율적인 장거리 문맥 LLM 추론

초록

긴 문맥의 대형 언어 모델(LLMs)을 배포하는 것은 중요하지만 상당한 계산 및 메모리 도전을 야기합니다. 모든 어텐션 헤드를 통해 모든 Key 및 Value (KV) 상태를 캐싱하는 것은 상당한 메모리를 소비합니다. 기존의 KV 캐시 가지치기 방법은 LLMs의 긴 문맥 능력을 손상시키거나 효율성 향상이 제한적인 경우가 있습니다. 본 논문에서는 일부 어텐션 헤드, 즉 검색 헤드라고도 하는 것이 긴 문맥을 처리하는 데 중요하며 모든 토큰에 대해 완전한 주의가 필요한 것을 확인했습니다. 반면, 최근 토큰 및 어텐션 싱크에 주로 초점을 맞추는 다른 모든 헤드, 즉 스트리밍 헤드는 완전한 주의가 필요하지 않습니다. 이 통찰을 바탕으로, 우리는 DuoAttention을 소개합니다. 이는 검색 헤드에만 완전한 KV 캐시를 적용하고 스트리밍 헤드에는 가벼운, 고정 길이의 KV 캐시를 사용하여 LLM의 디코딩 및 사전 채우기 메모리 및 지연 시간을 줄이면서도 그 긴 문맥 능력을 희생하지 않습니다. DuoAttention은 가벼운 최적화 기반 알고리즘과 합성 데이터를 사용하여 검색 헤드를 정확하게 식별합니다. 우리의 방법은 MHA 모델의 경우 최대 2.55배, GQA 모델의 경우 최대 1.67배의 긴 문맥 추론 메모리를 줄이고, 디코딩 속도를 최대 2.18배, 1.50배 빠르게 하며, 사전 채우기 속도를 각각 최대 1.73배, 1.63배 빠르게 합니다. 완전한 주의와 비교하여 최소한의 정확도 손실로, DuoAttention은 양자화와 결합하여 단일 A100 GPU에서 3.3백만 문맥 길이로 Llama-3-8B 디코딩을 가능하게 합니다. 코드는 https://github.com/mit-han-lab/duo-attention에서 제공됩니다.

English

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

DuoAttention: 검색 및 스트리밍 헤드를 활용한 효율적인 장거리 문맥 LLM 추론

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

초록

Summary

Support