ReMoE:具有ReLU路由的全可微专家混合模型

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

December 19, 2024
作者: Ziteng Wang, Jianfei Chen, Jun Zhu
cs.AI

摘要

稀疏激活的专家混合(MoE)模型被广泛采用,以扩大模型容量而不增加计算预算。然而,普通的TopK路由器以不连续、不可微分的方式进行训练,限制了它们的性能和可扩展性。为了解决这个问题,我们提出了ReMoE,这是一种完全可微分的MoE架构,为传统的TopK+Softmax路由提供了一个简单而有效的替代方案,利用ReLU作为路由器。我们进一步提出了调节路由器稀疏性并在专家之间平衡负载的方法。ReMoE的连续性使得能够在令牌和层之间有效地动态分配计算,同时还展现出领域专业化。我们的实验证明,ReMoE在各种模型大小、专家数量和粒度级别上始终优于普通的TopK路由MoE。此外,相较于传统的MoE架构,ReMoE在专家数量方面表现出更好的可扩展性。基于Megatron-LM的实现可在https://github.com/thu-ml/ReMoE找到。
English
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

Summary

AI-Generated Summary

PDF152December 25, 2024