DataDecide：如何通過小型實驗預測最佳預訓練數據

摘要

鉴于大型语言模型在不同数据集上的预训练成本高昂，利用小规模实验来决定数据选择对于降低成本至关重要。哪些基准测试方法以及从小规模观察到的性能中做出决策的方法，能够最准确地预测出能产生最佳大型模型的数据集？为了促进这一问题的开放探索，我们发布了DataDecide——一个包含最广泛开放模型套件的资源，这些模型在数据和规模上存在差异。我们进行了控制下的预训练实验，涵盖了25个不同来源、去重和过滤的语料库，训练规模高达1000亿个标记，模型参数规模高达10亿，并使用了3个随机种子。我们发现，在单一小规模（例如1.5亿参数）下模型的排名，对于预测我们更大目标规模（10亿）下的最佳模型是一个强有力的基线（约80%的比较正确）。在8个基线方法中，没有任何一种缩放法则方法能够超越单一规模预测的计算决策前沿，但DataDecide能够衡量未来缩放法则的改进。我们还发现，在小规模实验中使用连续似然度量作为代理，使得包括MMLU、ARC、HellaSwag、MBPP和HumanEval在内的基准测试在目标10亿规模下的可预测性超过80%，而仅需0.01%的计算资源。

English

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

DataDecide：如何通過小型實驗預測最佳預訓練數據

DataDecide: How to Predict Best Pretraining Data with Small Experiments

摘要

Summary

Support

Support