ChatPaper.aiChatPaper

MoH:多頭注意力作為注意力頭混合

MoH: Multi-Head Attention as Mixture-of-Head Attention

October 15, 2024
作者: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan
cs.AI

摘要

在這項工作中,我們升級了 Transformer 模型的核心,即多頭注意力機制,以提高效率,同時保持或超越先前的準確性水平。我們表明多頭注意力可以用總和形式表示。基於並非所有注意力頭具有相等重要性的見解,我們提出了「頭部混合注意力」(MoH),這是一種新的架構,將注意力頭視為「專家混合」(MoE)機制中的專家。MoH 具有兩個顯著優勢:首先,MoH 使每個標記可以選擇適當的注意力頭,增強推理效率,而不會影響準確性或增加參數數量。其次,MoH 將多頭注意力中的標準總和替換為加權總和,為注意力機制引入靈活性,並發揮額外的性能潛力。對 ViT、DiT 和 LLMs 的大量實驗表明,MoH 通過僅使用 50%-90% 的注意力頭就優於多頭注意力。此外,我們證明預訓練的多頭注意力模型,如 LLaMA3-8B,可以進一步調整為我們的 MoH 模型。值得注意的是,MoH-LLaMA3-8B 在 14 個基準測試中實現了平均 64.0% 的準確性,僅使用 75% 的注意力頭就比 LLaMA3-8B 優越 2.4%。我們相信提出的 MoH 是多頭注意力的一個有前途的替代方案,為開發先進且高效的基於注意力的模型奠定了堅實基礎。
English
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

Summary

AI-Generated Summary

PDF222November 16, 2024