ChatPaper.aiChatPaper

CLIMB:基于聚类的迭代数据混合自举语言模型预训练方法

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

April 17, 2025
作者: Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

预训练数据集通常从网络内容中收集,缺乏固有的领域划分。例如,广泛使用的Common Crawl数据集并未包含明确的领域标签,而手动整理如The Pile这样的标注数据集则费时费力。因此,尽管优化预训练数据混合对提升预训练性能具有显著益处,但确定最佳预训练数据混合仍是一个具有挑战性的问题。为应对这些挑战,我们提出了基于聚类的迭代数据混合自举框架(CLIMB),这是一个在预训练环境中自动发现、评估并优化数据混合的框架。具体而言,CLIMB将大规模数据集嵌入并聚类于语义空间,随后利用较小的代理模型和预测器迭代搜索最优混合方案。当我们的1B模型在4000亿个token上持续训练并采用此混合方案时,其性能超越了当前最先进的Llama-3.2-1B模型2.0%。此外,我们观察到针对特定领域(如社会科学)进行优化,相比随机采样可带来5%的性能提升。最后,我们推出了ClimbLab,一个包含20个聚类、经过筛选的1.2万亿token语料库,作为研究平台;以及ClimbMix,一个紧凑而强大的4000亿token数据集,专为高效预训练设计,在同等token预算下展现出卓越性能。我们分析了最终的数据混合,阐明了最优数据混合的特征。我们的数据可在以下网址获取:https://research.nvidia.com/labs/lpr/climb/。
English
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

Summary

AI-Generated Summary

PDF862April 18, 2025