TransMLA: 다중 헤드 잠재적 주의가 당신이 필요한 모든 것입니다.

초록

현대의 대규모 언어 모델(LLM)은 순수한 계산 제약이 아닌 현재 하드웨어에서 통신 병목 현상을 종종 겪습니다. Multi-head Latent Attention(MLA)은 키-값(KV) 레이어에서 저랭크 행렬을 사용하여 압축된 잠재 KV 상태를 캐싱할 수 있도록 함으로써 이 도전 과제에 대처합니다. 이 방법은 기존의 다중 헤드 어텐션에 비해 KV 캐시 크기를 크게 줄여 빠른 추론을 이끌어냅니다. 게다가, MLA는 추가 계산을 교환하여 통신 오버헤드를 줄이기 위해 표현력을 높이기 위해 상향 투영 행렬을 사용합니다. MLA는 Deepseek V2/V3/R1에서 효율성과 효과를 입증했지만, 많은 주요 모델 제공 업체는 여전히 Group Query Attention(GQA)를 사용하고 있으며 MLA를 채택할 계획을 발표하지 않았습니다. 본 논문에서는 GQA가 언제나 같은 KV 캐시 오버헤드를 유지하면서 MLA로 표현될 수 있음을 보여주지만 그 역은 성립하지 않습니다. MLA의 보다 넓은 사용을 촉진하기 위해 우리는 **TransMLA**를 소개합니다. 이는 널리 사용되는 GQA 기반 사전 훈련 모델(LLaMA, Qwen, Mixtral 등)을 MLA 기반 모델로 변환하는 사후 훈련 방법입니다. 변환 후 모델은 KV 캐시 크기를 증가시키지 않고 표현력을 향상시키기 위해 추가 훈련을 받을 수 있습니다. 게다가, 우리는 변환된 모델에서 저지연을 유지하기 위한 MLA 특화 추론 가속 기술을 개발할 계획이며, 이를 통해 Deepseek R1의 효율적인 증류를 가능하게 할 것입니다.

English

Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce **TransMLA**, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.

TransMLA: 다중 헤드 잠재적 주의가 당신이 필요한 모든 것입니다.

TransMLA: Multi-head Latent Attention Is All You Need

초록

Support