손실 대 손실 예측: 모든 데이터셋에 대한 스케일링 법칙

초록

스케일링 법칙은 단일 데이터 분포에 대한 컴퓨팅 스케일을 통해 기차 손실을 예측하는 믿을만한 방법론을 제공하지만, 분포를 변경할 때 이러한 예측이 어떻게 변해야 하는지에 대한 정보는 적습니다. 본 논문에서는 한 손실을 다른 손실로 예측하는 전략을 유도하고, 이를 적용하여 다양한 사전 훈련 데이터셋 간 및 사전 훈련 데이터에서 하류 작업 데이터로 예측하는 방법을 제시합니다. 우리의 예측은 곡선을 맞추기 위해 사용된 최대 FLOP 예산의 20배에 이르는 범위에서도 잘 외삽됩니다. 더 정확히 말하면, 훈련 컴퓨팅에 의해 쌍으로 매칭된 두 개의 모델이 두 개의 별도 데이터셋에서 훈련된 경우의 (1) 훈련 손실, (2) 단일 모델의 하류 분포에서의 훈련 손실과 테스트 손실, (3) 두 개의 모델이 서로 다른 훈련 데이터셋에서 훈련된 경우의 테스트 손실 사이에 간단한 이동된 거듭제곱 법칙 관계가 있음을 발견했습니다. 이러한 결과는 상당히 다른 사전 훈련 데이터셋(일부는 코드 전혀 없음) 및 다양한 하류 작업에 대해 유효합니다. 마지막으로, 일부 상황에서 이러한 이동된 거듭제곱 법칙 관계가 단일 데이터셋 스케일링 법칙을 외삽하는 것보다 더 정확한 예측을 제공할 수 있다는 것을 발견했습니다.

English

While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.

손실 대 손실 예측: 모든 데이터셋에 대한 스케일링 법칙

Loss-to-Loss Prediction: Scaling Laws for All Datasets

초록

Support