通过高效计算模型阶梯建立任务缩放定律

摘要

我们制定了任务缩放定律和模型阶梯，以预测预训练语言模型（LMs）在过度训练设置中的个别任务性能。标准的语言建模损失幂律不能准确地模拟任务性能。因此，我们利用两步预测方法：首先使用模型和数据大小来预测特定任务的损失，然后使用该任务损失来预测任务性能。我们训练了一组小规模的“阶梯”模型，收集数据点以拟合两个预测步骤的参数化函数，并为两个目标模型进行预测：一个训练了4T标记的7B模型和一个训练了5T标记的13B模型。训练阶梯模型仅耗费目标模型计算量的1%。在四个以排名分类格式编写的多项选择任务中，我们可以预测两个目标模型的准确性，误差不超过2个点。我们在另外四个任务上的预测误差较大（平均绝对误差为6.9），发现这些任务通常具有较高的任务指标方差。我们还发现，使用更少的计算资源训练更少的阶梯模型往往会导致预测结果恶化。最后，我们经验性地展示了我们的设计选择和两步方法在建立缩放定律方面表现出优越性能。

English

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.

通过高效计算模型阶梯建立任务缩放定律

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

摘要

Summary

Support

Support