蒸馏扩展定律
Distillation Scaling Laws
February 12, 2025
作者: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
cs.AI
摘要
我们提供了一种蒸馏规模定律,该定律根据计算预算及其在学生和教师之间的分配来估计蒸馏模型的性能。我们的研究结果降低了在大规模应用蒸馏时的风险;现在可以对教师和学生模型的计算分配进行优化,以最大化学生的性能。我们为以下情况提供了计算最优的蒸馏配方:1)存在教师,或者2)需要对教师进行训练。如果需要对多个学生进行蒸馏,或者已经存在教师,蒸馏会优于监督预训练,直到一个与学生规模增长可预测的计算水平。如果只需要对一个学生进行蒸馏且教师也需要训练,则应改为进行监督学习。此外,我们通过大规模研究蒸馏提供了一些见解,这些见解增进了我们对蒸馏的理解,并为实验设计提供了信息。
English
We provide a distillation scaling law that estimates distilled model
performance based on a compute budget and its allocation between the student
and teacher. Our findings reduce the risks associated with using distillation
at scale; compute allocation for both the teacher and student models can now be
done to maximize student performance. We provide compute optimal distillation
recipes for when 1) a teacher exists, or 2) a teacher needs training. If many
students are to be distilled, or a teacher already exists, distillation
outperforms supervised pretraining until a compute level which grows
predictably with student size. If one student is to be distilled and a teacher
also needs training, supervised learning should be done instead. Additionally,
we provide insights across our large scale study of distillation, which
increase our understanding of distillation and inform experimental design.Summary
AI-Generated Summary