ReMoE: ReLU 라우팅을 사용한 완전 미분 가능한 전문가 집합입니다.

초록

희소하게 활성화된 전문가들의 혼합 (MoE) 모델은 계산 예산을 증가시키지 않고 모델 용량을 확장하는 데 널리 사용됩니다. 그러나 일반적인 TopK 라우터는 불연속하고 미분 불가능한 방식으로 훈련되어 성능과 확장성이 제한됩니다. 이 문제를 해결하기 위해 우리는 ReMoE를 제안합니다. 이는 전통적인 TopK+Softmax 라우팅을 대체할 수 있는 간단하면서도 효과적인 완전 미분 가능한 MoE 아키텍처로, 라우터로 ReLU를 활용합니다. 또한 전문가들 사이의 부하를 균형 있게 조절하면서 라우터의 희소성을 조절하는 방법을 제안합니다. ReMoE의 연속적인 특성은 토큰과 레이어 간의 효율적인 동적 계산 할당을 가능하게 하며 도메인 특화를 나타냅니다. 실험 결과, ReMoE가 다양한 모델 크기, 전문가 수 및 세분화 수준에 걸쳐 일반적인 TopK 라우팅된 MoE보다 일관되게 우수한 성능을 보여줍니다. 더 나아가, ReMoE는 전통적인 MoE 아키텍처를 능가하는 전문가 수에 대한 우수한 확장성을 나타냅니다. Megatron-LM을 기반으로 한 구현은 https://github.com/thu-ml/ReMoE에서 제공됩니다.

English

Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

ReMoE: ReLU 라우팅을 사용한 완전 미분 가능한 전문가 집합입니다.

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

초록

Summary

Support

Support