MoH：多頭注意力作為注意力頭混合

摘要

在這項工作中，我們升級了 Transformer 模型的核心，即多頭注意力機制，以提高效率，同時保持或超越先前的準確性水平。我們表明多頭注意力可以用總和形式表示。基於並非所有注意力頭具有相等重要性的見解，我們提出了「頭部混合注意力」（MoH），這是一種新的架構，將注意力頭視為「專家混合」（MoE）機制中的專家。MoH 具有兩個顯著優勢：首先，MoH 使每個標記可以選擇適當的注意力頭，增強推理效率，而不會影響準確性或增加參數數量。其次，MoH 將多頭注意力中的標準總和替換為加權總和，為注意力機制引入靈活性，並發揮額外的性能潛力。對 ViT、DiT 和 LLMs 的大量實驗表明，MoH 通過僅使用 50%-90% 的注意力頭就優於多頭注意力。此外，我們證明預訓練的多頭注意力模型，如 LLaMA3-8B，可以進一步調整為我們的 MoH 模型。值得注意的是，MoH-LLaMA3-8B 在 14 個基準測試中實現了平均 64.0% 的準確性，僅使用 75% 的注意力頭就比 LLaMA3-8B 優越 2.4%。我們相信提出的 MoH 是多頭注意力的一個有前途的替代方案，為開發先進且高效的基於注意力的模型奠定了堅實基礎。

English

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

MoH：多頭注意力作為注意力頭混合

MoH: Multi-Head Attention as Mixture-of-Head Attention

摘要

Summary

Support

Support