損失對損失預測:所有數據集的比例定律
Loss-to-Loss Prediction: Scaling Laws for All Datasets
November 19, 2024
作者: David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade
cs.AI
摘要
儘管規模定律提供了一種可靠的方法來預測在單一數據分佈下不同計算規模下的訓練損失,但對於當我們改變分佈時應如何調整這些預測的情況了解較少。本文中,我們提出了一種預測一種損失從另一種損失的策略,並將其應用於預測不同的預訓練數據集以及從預訓練數據到下游任務數據的情況。我們的預測即使在擬合曲線所使用的最大 FLOP 預算的 20 倍時也能很好地外推。更確切地說,我們發現當模型按訓練計算配對時(訓練對訓練),在兩個分別在兩個不同數據集上訓練的模型的訓練損失之間,以及對於單個模型在任何下游分佈上的訓練損失和測試損失之間(訓練對測試),以及兩個在兩個不同訓練數據集上訓練的模型的測試損失之間(測試對測試),存在著簡單的移位冪律關係。這些結果適用於差異顯著的預訓練數據集(有些完全是程式碼,而其他一些根本沒有程式碼)以及各種下游任務。最後,我們發現在某些情況下,這些移位冪律關係可以比外推單一數據集規模定律產生更準確的預測。
English
While scaling laws provide a reliable methodology for predicting train loss
across compute scales for a single data distribution, less is known about how
these predictions should change as we change the distribution. In this paper,
we derive a strategy for predicting one loss from another and apply it to
predict across different pre-training datasets and from pre-training data to
downstream task data. Our predictions extrapolate well even at 20x the largest
FLOP budget used to fit the curves. More precisely, we find that there are
simple shifted power law relationships between (1) the train losses of two
models trained on two separate datasets when the models are paired by training
compute (train-to-train), (2) the train loss and the test loss on any
downstream distribution for a single model (train-to-test), and (3) the test
losses of two models trained on two separate train datasets (test-to-test). The
results hold up for pre-training datasets that differ substantially (some are
entirely code and others have no code at all) and across a variety of
downstream tasks. Finally, we find that in some settings these shifted power
law relationships can yield more accurate predictions than extrapolating
single-dataset scaling laws.Summary
AI-Generated Summary