MoH: 다중 헤드 어텐션을 헤드 혼합 어텐션으로 번역합니다.

초록

본 연구에서는 Transformer 모델의 핵심인 다중 헤드 어텐션 메커니즘을 업그레이드하여 효율성을 향상시키면서 이전 정확도 수준을 유지하거나 능가합니다. 다중 헤드 어텐션이 합의 형태로 표현될 수 있다는 것을 보여줍니다. 모든 어텐션 헤드가 동등한 중요성을 가지지 않는다는 통찰을 바탕으로, 어텐션 헤드를 Mixture-of-Experts(MoE) 메커니즘의 전문가로 취급하는 새로운 아키텍처인 Mixture-of-Head attention (MoH)를 제안합니다. MoH에는 두 가지 중요한 장점이 있습니다. 첫째, MoH는 각 토큰이 적절한 어텐션 헤드를 선택할 수 있도록 하여 추론 효율성을 향상시키면서 정확도를 희생하지 않거나 매개 변수 수를 증가시키지 않습니다. 둘째, MoH는 다중 헤드 어텐션의 표준 합 대신 가중 합을 도입하여 어텐션 메커니즘에 유연성을 부여하고 추가 성능 잠재력을 발휘합니다. ViT, DiT 및 LLMs에 대한 광범위한 실험 결과 MoH가 어텐션 헤드의 50%-90%만 사용하여 다중 헤드 어텐션을 능가함을 보여줍니다. 더불어, LLaMA3-8B와 같은 사전 훈련된 다중 헤드 어텐션 모델을 MoH 모델로 추가 조정할 수 있음을 입증합니다. 특히, MoH-LLaMA3-8B는 14개의 벤치마크에서 64.0%의 평균 정확도를 달성하여 어텐션 헤드의 75%만 사용하여 LLaMA3-8B를 2.4% 능가합니다. 제안된 MoH가 다중 헤드 어텐션에 대한 유망한 대안이며, 고급 및 효율적인 어텐션 기반 모델을 개발하기 위한 견고한 기반을 제공한다고 믿습니다.

English

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

MoH: 다중 헤드 어텐션을 헤드 혼합 어텐션으로 번역합니다.

MoH: Multi-Head Attention as Mixture-of-Head Attention

초록

Summary

Support