宝石:多面缩放定律模型套件
Gemstones: A Model Suite for Multi-Faceted Scaling Laws
February 7, 2025
作者: Sean McLeish, John Kirchenbauer, David Yu Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein
cs.AI
摘要
通常,缩放定律是使用一系列具有狭窄范围冻结超参数选择的模型族进行拟合的。在这项工作中,我们使用广泛的架构和超参数选择研究缩放定律,并突出它们对结果预测的影响。作为我们研究的主要成果,我们发布了Gemstones:迄今为止最全面的开源缩放定律数据集,包括来自拥有高达20亿参数的变压器的4000多个检查点;这些模型已经使用不同的学习率、冷却计划和架构形状进行训练。我们的检查点使得可以进行更复杂的缩放研究,比如一种定律,它预测语言建模性能作为模型宽度和深度的函数。通过检查我们模型套件的各个方面,我们发现缩放定律的预测可能对实验设计过程和拟合过程中使用的特定模型检查点非常敏感。源代码:https://github.com/mcleish7/gemstone-scaling-laws
English
Scaling laws are typically fit using a family of models with a narrow range
of frozen hyper-parameter choices. In this work we study scaling laws using a
wide range of architecture and hyper-parameter choices, and highlight their
impact on resulting prescriptions. As a primary artifact of our research, we
release the Gemstones: the most comprehensive open-source scaling law dataset
to date, consisting of over 4000 checkpoints from transformers with up to 2
billion parameters; these models have been trained with different learning
rates, cooldown schedules, and architectural shapes. Our checkpoints enable
more complex studies of scaling, such as a law that predicts language modeling
performance as a function of model width and depth. By examining the various
facets of our model suite, we find that the prescriptions of scaling laws can
be highly sensitive to the experimental design process and the specific model
checkpoints used during fitting. Code:
https://github.com/mcleish7/gemstone-scaling-lawsSummary
AI-Generated Summary