GRIN: 梯度信息化的MoE
GRIN: GRadient-INformed MoE
September 18, 2024
作者: Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
cs.AI
摘要
專家混合模型(Mixture-of-Experts, MoE)由於透過專家路由實現稀疏計算,僅選擇性地啟動少量專家模塊,因此比密集模型更有效地擴展規模。然而,稀疏計算挑戰傳統的訓練方法,因為離散的專家路由阻礙了標準反向傳播,進而影響基於梯度的優化,這是深度學習的基石。為了更好地追求MoE的擴展能力,我們引入了GRIN(GRadient-INformed MoE training),該方法結合了專家路由的稀疏梯度估計,並配置模型並行性以避免標記丟失。將GRIN應用於自回歸語言建模,我們開發了一個頂尖的16times3.8B MoE模型。我們的模型僅激活了6.6B個參數,優於一個7B的密集模型,並與在相同數據上訓練的14B密集模型的性能相匹敵。通過對多個任務的廣泛評估顯示,GRIN有望顯著增強MoE的效能,在MMLU上達到79.4,在HellaSwag上達到83.7,在HumanEval上達到74.4,在MATH上達到58.9。
English
Mixture-of-Experts (MoE) models scale more effectively than dense models due
to sparse computation through expert routing, selectively activating only a
small subset of expert modules. However, sparse computation challenges
traditional training practices, as discrete expert routing hinders standard
backpropagation and thus gradient-based optimization, which are the cornerstone
of deep learning. To better pursue the scaling power of MoE, we introduce GRIN
(GRadient-INformed MoE training), which incorporates sparse gradient estimation
for expert routing and configures model parallelism to avoid token dropping.
Applying GRIN to autoregressive language modeling, we develop a top-2
16times3.8B MoE model. Our model, with only 6.6B activated parameters,
outperforms a 7B dense model and matches the performance of a 14B dense model
trained on the same data. Extensive evaluations across diverse tasks
demonstrate the potential of GRIN to significantly enhance MoE efficacy,
achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.Summary
AI-Generated Summary