ReMoE: ReLU ルーティングを備えた完全に微分可能なエキスパートの混合

要旨

スパースに活性化されたMixture-of-Experts（MoE）モデルは、計算予算を増やさずにモデル容量を拡大するために広く採用されています。ただし、通常のTopKルーターは不連続で微分不可能な方法で訓練されており、パフォーマンスとスケーラビリティが制限されています。この問題に対処するために、我々はReMoEを提案します。これは、従来のTopK+Softmaxルーティングの代わりにReLUをルーターとして利用する、完全に微分可能なMoEアーキテクチャであり、単純で効果的な置換を提供します。さらに、専門家の間で負荷をバランス良く配分する方法を提案しています。ReMoEの連続的な性質により、トークンとレイヤー間での効率的な動的計算の割り当てが可能となり、ドメインの特殊化も示されます。私たちの実験では、ReMoEがさまざまなモデルサイズ、専門家数、および粒度レベルにわたって、常に通常のTopKルーティングされたMoEを上回ることを示しています。さらに、ReMoEは、従来のMoEアーキテクチャを超える、専門家の数に関する優れたスケーラビリティを示しています。Megatron-LMに基づいた実装は、https://github.com/thu-ml/ReMoE で入手可能です。

English

Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

ReMoE: ReLU ルーティングを備えた完全に微分可能なエキスパートの混合

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

要旨

Summary

Support

Support