DataDecide:如何通过小型实验预测最佳预训练数据
DataDecide: How to Predict Best Pretraining Data with Small Experiments
April 15, 2025
作者: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
cs.AI
摘要
由于大型语言模型在不同数据集上进行预训练成本高昂,因此通过小规模实验来决定数据选择对于降低成本至关重要。哪些基准测试方法以及从小规模观察到的性能中做出决策的方法,能够最准确地预测出能产生最佳大型模型的数据集?为了推动这一问题的开放探索,我们发布了DataDecide——这是最全面的开放模型套件,涵盖了数据和规模差异。我们进行了控制性预训练实验,涉及25个不同来源、去重和过滤的语料库,规模高达1000亿个标记,模型参数规模高达10亿,并使用了3个随机种子。我们发现,在单一小规模(例如1.5亿参数)下模型的排名,对于预测我们更大目标规模(10亿)下的最佳模型是一个强有力的基线(约80%的比较正确)。在8个基线方法中,没有任何一种缩放定律方法能够超越单尺度预测的计算决策边界,但DataDecide能够衡量未来缩放定律的改进。我们还发现,在小规模实验中使用连续似然度量作为代理,使得包括MMLU、ARC、HellaSwag、MBPP和HumanEval在内的基准测试在目标10亿规模下仅需0.01%的计算量即可实现超过80%的预测准确性。
English
Because large language models are expensive to pretrain on different
datasets, using smaller-scale experiments to decide on data is crucial for
reducing costs. Which benchmarks and methods of making decisions from observed
performance at small scale most accurately predict the datasets that yield the
best large models? To empower open exploration of this question, we release
models, data, and evaluations in DataDecide -- the most extensive open suite of
models over differences in data and scale. We conduct controlled pretraining
experiments across 25 corpora with differing sources, deduplication, and
filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random
seeds. We find that the ranking of models at a single, small size (e.g., 150M
parameters) is a strong baseline for predicting best models at our larger
target scale (1B) (~80% of com parisons correct). No scaling law methods among
8 baselines exceed the compute-decision frontier of single-scale predictions,
but DataDecide can measure improvement in future scaling laws. We also identify
that using continuous likelihood metrics as proxies in small experiments makes
benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable
at the target 1B scale with just 0.01% of the compute.Summary
AI-Generated Summary