MH-MoE:多頭專家混合模型
MH-MoE:Multi-Head Mixture-of-Experts
November 25, 2024
作者: Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei
cs.AI
摘要
多頭專家混合(MH-MoE)通過使用多頭機制共同關注不同專家內各種表示空間的信息,展現出卓越的性能。本文提出了一種新的MH-MoE實現,與稀疏專家混合模型在FLOPs和參數方面保持一致。對語言模型的實驗結果表明,新的實現比普通MoE和細粒度MoE模型都有質量改進。此外,我們的實驗表明MH-MoE與1位元大型語言模型(LLMs)如BitNet兼容。
English
Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by
using the multi-head mechanism to collectively attend to information from
various representation spaces within different experts. In this paper, we
present a novel implementation of MH-MoE that maintains both FLOPs and
parameter parity with sparse Mixture of Experts models. Experimental results on
language models show that the new implementation yields quality improvements
over both vanilla MoE and fine-grained MoE models. Additionally, our
experiments demonstrate that MH-MoE is compatible with 1-bit Large Language
Models (LLMs) such as BitNet.Summary
AI-Generated Summary