ChatPaper.aiChatPaper

CLIMB:基於聚類的迭代數據混合自舉法用於語言模型預訓練

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

April 17, 2025
作者: Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

預訓練數據集通常從網絡內容中收集,缺乏固有的領域劃分。例如,廣泛使用的數據集如Common Crawl並未包含明確的領域標籤,而手動整理帶標籤的數據集如The Pile則耗費大量人力。因此,儘管優化預訓練數據混合對提升預訓練性能具有顯著益處,但如何確定最佳預訓練數據混合仍是一個具有挑戰性的問題。為應對這些挑戰,我們提出了基於聚類的迭代數據混合引導框架(CLIMB),這是一個自動化框架,用於在預訓練環境中發現、評估和優化數據混合。具體而言,CLIMB將大規模數據集嵌入並聚類於語義空間中,然後使用較小的代理模型和預測器迭代搜索最佳混合。當我們的1B模型在4000億個token上持續訓練並使用此混合時,其性能超越了當前最先進的Llama-3.2-1B模型2.0%。此外,我們觀察到,針對特定領域(如社會科學)進行優化,相比隨機採樣可帶來5%的性能提升。最後,我們推出了ClimbLab,這是一個包含20個聚類的過濾後1.2萬億token語料庫,作為研究平臺;以及ClimbMix,這是一個緊湊而強大的4000億token數據集,專為高效預訓練設計,在相同token預算下提供卓越性能。我們分析了最終的數據混合,闡明了最佳數據混合的特徵。我們的數據可在以下網址獲取:https://research.nvidia.com/labs/lpr/climb/
English
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

Summary

AI-Generated Summary

PDF782April 18, 2025