MoH:多頭注意力作為注意力頭混合
MoH: Multi-Head Attention as Mixture-of-Head Attention
October 15, 2024
作者: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan
cs.AI
摘要
在這項工作中,我們升級了 Transformer 模型的核心,即多頭注意力機制,以提高效率,同時保持或超越先前的準確性水平。我們表明多頭注意力可以用總和形式表示。基於並非所有注意力頭具有相等重要性的見解,我們提出了「頭部混合注意力」(MoH),這是一種新的架構,將注意力頭視為「專家混合」(MoE)機制中的專家。MoH 具有兩個顯著優勢:首先,MoH 使每個標記可以選擇適當的注意力頭,增強推理效率,而不會影響準確性或增加參數數量。其次,MoH 將多頭注意力中的標準總和替換為加權總和,為注意力機制引入靈活性,並發揮額外的性能潛力。對 ViT、DiT 和 LLMs 的大量實驗表明,MoH 通過僅使用 50%-90% 的注意力頭就優於多頭注意力。此外,我們證明預訓練的多頭注意力模型,如 LLaMA3-8B,可以進一步調整為我們的 MoH 模型。值得注意的是,MoH-LLaMA3-8B 在 14 個基準測試中實現了平均 64.0% 的準確性,僅使用 75% 的注意力頭就比 LLaMA3-8B 優越 2.4%。我們相信提出的 MoH 是多頭注意力的一個有前途的替代方案,為開發先進且高效的基於注意力的模型奠定了堅實基礎。
English
In this work, we upgrade the multi-head attention mechanism, the core of the
Transformer model, to improve efficiency while maintaining or surpassing the
previous accuracy level. We show that multi-head attention can be expressed in
the summation form. Drawing on the insight that not all attention heads hold
equal significance, we propose Mixture-of-Head attention (MoH), a new
architecture that treats attention heads as experts in the Mixture-of-Experts
(MoE) mechanism. MoH has two significant advantages: First, MoH enables each
token to select the appropriate attention heads, enhancing inference efficiency
without compromising accuracy or increasing the number of parameters. Second,
MoH replaces the standard summation in multi-head attention with a weighted
summation, introducing flexibility to the attention mechanism and unlocking
extra performance potential. Extensive experiments on ViT, DiT, and LLMs
demonstrate that MoH outperforms multi-head attention by using only 50%-90% of
the attention heads. Moreover, we demonstrate that pre-trained multi-head
attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH
models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14
benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the
attention heads. We believe the proposed MoH is a promising alternative to
multi-head attention and provides a strong foundation for developing advanced
and efficient attention-based models.Summary
AI-Generated Summary