TransMLA:多头潜在注意力就是你所需要的。
TransMLA: Multi-head Latent Attention Is All You Need
February 11, 2025
作者: Fanxu Meng, Zengwei Yao, Muhan Zhang
cs.AI
摘要
现代大型语言模型(LLMs)通常在当前硬件上遇到通信瓶颈,而不仅仅是计算约束。多头潜在注意力(MLA)通过在键-值(KV)层中使用低秩矩阵来解决这一挑战,从而允许缓存压缩的潜在KV状态。这种方法相对于传统的多头注意力显著减少了KV缓存大小,从而实现更快的推理。此外,MLA采用上投影矩阵来增加表达能力,以交换额外的计算以减少通信开销。尽管MLA在Deepseek V2/V3/R1中表现出效率和有效性,许多主要模型提供商仍依赖于组查询注意力(GQA),并且尚未宣布采用MLA的任何计划。在本文中,我们展示了GQA始终可以通过MLA来表示,同时保持相同的KV缓存开销,但反之则不成立。为了鼓励更广泛地使用MLA,我们引入了**TransMLA**,这是一种后训练方法,将广泛使用基于GQA的预训练模型(例如LLaMA、Qwen、Mixtral)转换为基于MLA的模型。转换后,模型可以进行额外训练以提升表达能力,而不增加KV缓存大小。此外,我们计划开发MLA特定的推理加速技术,以保持转换模型的低延迟,从而实现更有效地提取Deepseek R1。
English
Modern large language models (LLMs) often encounter communication bottlenecks
on current hardware, rather than purely computational constraints. Multi-head
Latent Attention (MLA) tackles this challenge by using low-rank matrices in the
key-value (KV) layers, thereby allowing compressed latent KV states to be
cached. This approach significantly reduces the KV cache size relative to
traditional multi-head attention, leading to faster inference. Moreover, MLA
employs an up-projection matrix to increase expressiveness, trading additional
computation for reduced communication overhead. Although MLA has demonstrated
efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers
still rely on Group Query Attention (GQA) and have not announced any plans to
adopt MLA. In this paper, we show that GQA can always be represented by MLA
while maintaining the same KV cache overhead, but the converse does not hold.
To encourage broader use of MLA, we introduce **TransMLA**, a post-training
method that converts widely used GQA-based pre-trained models (e.g., LLaMA,
Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo
additional training to boost expressiveness without increasing the KV cache
size. Furthermore, we plan to develop MLA-specific inference acceleration
techniques to preserve low latency in transformed models, thus enabling more
efficient distillation of Deepseek R1.Summary
AI-Generated Summary