세부 사항 속의 악마: 훈련을 위한 로드 밸런싱 손실 구현에 대해 전문가 모델 혼합법에 관한 연구

초록

본 논문은 Mixture-of-Experts (MoEs) 모델을 훈련할 때 Load-balancing Loss (LBL)의 구현을 재방문한다. 구체적으로, MoEs에 대한 LBL은 N_E sum_{i=1}^{N_E} f_i p_i로 정의되며, 여기서 N_E는 전체 전문가 수, f_i는 전문가 i가 선택된 빈도를 나타내고, p_i는 전문가 i의 평균 게이팅 점수를 나타낸다. 기존 MoE 훈련 프레임워크는 일반적으로 병렬 훈련 전략을 사용하여 f_i와 LBL을 마이크로 배치 내에서 계산한 다음 병렬 그룹 간에 평균을 내곤 한다. 본질적으로, 십억 규모의 LLMs를 훈련하기 위한 마이크로 배치는 일반적으로 매우 적은 시퀀스를 포함한다. 따라서 마이크로 배치 LBL은 거의 시퀀스 수준이며, 라우터는 각 시퀀스 내에서 토큰을 고르게 분배하도록 밀어넣는다. 이 엄격한 제약 하에, 도메인 특정 시퀀스(예: 코드)의 토큰조차도 모든 전문가에게 균일하게 라우팅된다. 본 연구에서는 이 제약을 완화하기 위해 전역-배치를 사용하여 LBL을 계산하는 것을 제안한다. 전역-배치는 마이크로 배치보다 훨씬 다양한 시퀀스를 포함하므로 말뭉치 수준에서 부하 분산을 촉진할 것이다. 구체적으로, 우리는 f_i를 마이크로 배치 간에 동기화하기 위한 추가 통신 단계를 도입하고 이를 사용하여 LBL을 계산한다. 428억 개의 총 매개변수와 400억 개의 토큰을 사용하여 MoEs 기반 LLMs를 훈련하는 실험을 통해, 우리는 놀랄 만한 결과로 전역-배치 LBL 전략이 사전 훈련 퍼플렉서티와 하위 작업에서 우수한 성능 향상을 보여준다는 것을 발견했다. 우리의 분석 결과, 전역-배치 LBL은 또한 MoE 전문가의 도메인 전문화를 크게 향상시킨다.

English

This paper revisits the implementation of Load-balancing Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E sum_{i=1}^{N_E} f_i p_i, where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a micro-batch and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to 42.8B total parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.

세부 사항 속의 악마: 훈련을 위한 로드 밸런싱 손실 구현에 대해 전문가 모델 혼합법에 관한 연구

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

초록

Support