ChatPaper.aiChatPaper

预测性数据选择:能预测的数据即是能教学的数据

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

March 2, 2025
作者: Kashun Shum, Yuzhen Huang, Hongjian Zou, Ding Qi, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
cs.AI

摘要

语言模型预训练涉及在大量语料库上进行训练,其中数据质量起着关键作用。在本研究中,我们旨在直接评估预训练过程中数据的贡献,并以高效的方式选择预训练数据。具体而言,我们受到近期发现的启发,这些研究表明,当文本领域与下游基准对齐时,多样化模型在特定文本上的压缩效率(即归一化损失)与其下游性能密切相关(Huang等,2024)。基于这一观察,我们假设那些模型损失能预测下游能力的数据,同样对学习有显著贡献。为利用这一洞见,我们引入了一种基于数据预测强度(PreSelect)的数据选择方法,这是一种轻量级且高效的数据选择技术,仅需训练和部署一个基于fastText的评分器。通过在10亿和30亿参数模型上的全面实验,我们证明,使用PreSelect选出的300亿个token训练的模型,其性能超越了在3000亿个token上训练的普通基线模型,实现了计算需求10倍的降低。此外,在训练100亿个token的30亿参数模型规模上,PreSelect显著优于其他竞争性数据选择基线,如DCLM和FineWeb-Edu。我们在https://github.com/hkust-nlp/PreSelect开源了训练好的数据选择评分器及精选数据集。
English
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmark (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce data selection based on data's Predictive strength (Preselect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpasses the performance of a vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.

Summary

AI-Generated Summary

PDF532March 4, 2025