MH-MoE：多頭專家混合模型

摘要

多頭專家混合（MH-MoE）通過使用多頭機制共同關注不同專家內各種表示空間的信息，展現出卓越的性能。本文提出了一種新的MH-MoE實現，與稀疏專家混合模型在FLOPs和參數方面保持一致。對語言模型的實驗結果表明，新的實現比普通MoE和細粒度MoE模型都有質量改進。此外，我們的實驗表明MH-MoE與1位元大型語言模型（LLMs）如BitNet兼容。

English

Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.

MH-MoE：多頭專家混合模型

MH-MoE:Multi-Head Mixture-of-Experts

摘要

Support