Drop-Upcycling:通过部分重初始化训练稀疏专家混合模型
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
February 26, 2025
作者: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
cs.AI
摘要
混合专家(MoE)架构相较于同等容量的密集模型,显著降低了训练和推理成本。升级再利用(Upcycling)是一种利用预训练密集模型初始化和训练MoE模型的方法。尽管升级再利用在初期能带来性能提升,但其训练进度较从零开始训练更为缓慢,导致长期性能欠佳。我们提出了一种名为“丢弃式升级再利用”(Drop-Upcycling)的方法,有效解决了这一问题。该方法巧妙结合了两种看似矛盾的方式:既利用预训练密集模型的知识,又对部分权重进行统计意义上的重新初始化。这一策略性地促进了专家专业化,显著提升了MoE模型在知识获取上的效率。大规模实验表明,Drop-Upcycling在长期训练中,特别是在处理数千亿乃至更多标记时,显著优于以往的MoE构建方法。因此,我们的MoE模型仅需5.9亿活跃参数,就能实现与同一模型家族中130亿参数密集模型相当的性能,同时训练所需的浮点运算量(FLOPs)大约仅为后者的四分之一。所有实验资源,包括源代码、训练数据、模型检查点和日志,均已公开,以促进MoE研究的可重复性和未来探索。
English
The Mixture of Experts (MoE) architecture reduces the training and inference
cost significantly compared to a dense model of equivalent capacity. Upcycling
is an approach that initializes and trains an MoE model using a pre-trained
dense model. While upcycling leads to initial performance gains, the training
progresses slower than when trained from scratch, leading to suboptimal
performance in the long term. We propose Drop-Upcycling - a method that
effectively addresses this problem. Drop-Upcycling combines two seemingly
contradictory approaches: utilizing the knowledge of pre-trained dense models
while statistically re-initializing some parts of the weights. This approach
strategically promotes expert specialization, significantly enhancing the MoE
model's efficiency in knowledge acquisition. Extensive large-scale experiments
demonstrate that Drop-Upcycling significantly outperforms previous MoE
construction methods in the long term, specifically when training on hundreds
of billions of tokens or more. As a result, our MoE model with 5.9B active
parameters achieves comparable performance to a 13B dense model in the same
model family, while requiring approximately 1/4 of the training FLOPs. All
experimental resources, including source code, training data, model checkpoints
and logs, are publicly available to promote reproducibility and future research
on MoE.Summary
AI-Generated Summary