参数与FLOPs:混合专家语言模型最佳稀疏度的缩放定律
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
January 21, 2025
作者: Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak
cs.AI
摘要
扩展语言模型的容量一直被证明是改善性能和释放新能力的可靠方法。容量主要由两个维度来定义:模型参数的数量和每个示例的计算量。虽然扩展通常涉及增加这两者,但这些因素之间的精确相互作用及它们对整体容量的综合贡献仍未完全理解。我们在稀疏专家混合模型(MoEs)的背景下探讨这种关系,它允许扩展参数数量而不成比例地增加每个示例的FLOPs。我们研究了变化的稀疏水平,即非活跃参数的比例,对模型在预训练和下游少样本评估期间性能的影响。我们发现在不同约束条件下(例如参数大小和总训练计算量),存在一种最佳稀疏水平,可以提高训练效率和模型性能。这些结果更好地理解了MoEs的稀疏性对扩展定律的影响,并补充了该领域现有的研究,为设计更高效的架构提供了见解。
English
Scaling the capacity of language models has consistently proven to be a
reliable approach for improving performance and unlocking new capabilities.
Capacity can be primarily defined by two dimensions: the number of model
parameters and the compute per example. While scaling typically involves
increasing both, the precise interplay between these factors and their combined
contribution to overall capacity remains not fully understood. We explore this
relationship in the context of sparse Mixture-of-Experts (MoEs), which allow
scaling the number of parameters without proportionally increasing the FLOPs
per example. We investigate how varying the sparsity level, i.e., the fraction
of inactive parameters, impacts model's performance during pretraining and
downstream few-shot evaluation. We find that under different constraints (e.g.,
parameter size and total training compute), there is an optimal level of
sparsity that improves both training efficiency and model performance. These
results provide a better understanding of the impact of sparsity in scaling
laws for MoEs and complement existing works in this area, offering insights for
designing more efficient architectures.Summary
AI-Generated Summary