损失预测：所有数据集的比例定律

摘要

尽管规模定律为预测单一数据分布下的不同计算规模对训练损失的可靠方法提供了，但在改变数据分布时，我们对这些预测如何变化了解甚少。本文中，我们推导出一种策略，用于预测一种损失值从另一种损失值，并将其应用于跨不同预训练数据集的预测，以及从预训练数据到下游任务数据的预测。我们的预测能够在即使是比用于拟合曲线的最大 FLOP 预算大 20 倍的情况下也能很好地外推。更确切地说，我们发现当模型通过训练计算（训练对训练）配对时，在两个分别在两个不同数据集上训练的模型的训练损失之间，以及单个模型在任何下游分布上的训练损失和测试损失之间（训练对测试），以及两个在两个不同训练数据集上训练的模型的测试损失之间（测试对测试），存在简单的移位幂律关系。这些结果适用于差异很大的预训练数据集（有些完全由代码组成，而其他一些则没有任何代码），以及各种下游任务。最后，我们发现在某些情况下，这些移位幂律关系可以比外推单一数据集的规模定律产生更准确的预测。

English

While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.

损失预测：所有数据集的比例定律

Loss-to-Loss Prediction: Scaling Laws for All Datasets

摘要

Support