DataDecide：如何通过小型实验预测最佳预训练数据

摘要

由于大型语言模型在不同数据集上进行预训练成本高昂，因此通过小规模实验来决定数据选择对于降低成本至关重要。哪些基准测试方法以及从小规模观察到的性能中做出决策的方法，能够最准确地预测出能产生最佳大型模型的数据集？为了推动这一问题的开放探索，我们发布了DataDecide——这是最全面的开放模型套件，涵盖了数据和规模差异。我们进行了控制性预训练实验，涉及25个不同来源、去重和过滤的语料库，规模高达1000亿个标记，模型参数规模高达10亿，并使用了3个随机种子。我们发现，在单一小规模（例如1.5亿参数）下模型的排名，对于预测我们更大目标规模（10亿）下的最佳模型是一个强有力的基线（约80%的比较正确）。在8个基线方法中，没有任何一种缩放定律方法能够超越单尺度预测的计算决策边界，但DataDecide能够衡量未来缩放定律的改进。我们还发现，在小规模实验中使用连续似然度量作为代理，使得包括MMLU、ARC、HellaSwag、MBPP和HumanEval在内的基准测试在目标10亿规模下仅需0.01%的计算量即可实现超过80%的预测准确性。

English

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

DataDecide：如何通过小型实验预测最佳预训练数据

DataDecide: How to Predict Best Pretraining Data with Small Experiments

摘要

Summary

Support

Support