ReMoE:具有ReLU路由的全可微專家混合模型

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

December 19, 2024
作者: Ziteng Wang, Jianfei Chen, Jun Zhu
cs.AI

摘要

廣泛採用稀疏啟動的專家混合(MoE)模型,以擴展模型容量而無需增加計算預算。然而,普通的TopK路由器以不連續、不可微分的方式進行訓練,限制了其性能和可擴展性。為了解決這個問題,我們提出了ReMoE,一種完全可微分的MoE架構,為傳統的TopK+Softmax路由提供了一個簡單而有效的可替換方案,並將ReLU作為路由器。我們進一步提出了調節路由器稀疏性並平衡專家負載的方法。ReMoE的連續性使得能夠在標記和層之間有效動態分配計算,同時展現出領域專業化。我們的實驗表明,ReMoE在各種模型大小、專家數量和粒度級別上始終優於普通的TopK路由MoE。此外,相對於傳統的MoE架構,ReMoE在專家數量方面表現出更好的可擴展性。基於Megatron-LM的實現可在https://github.com/thu-ml/ReMoE找到。
English
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

Summary

AI-Generated Summary

PDF152December 25, 2024