程式化每個範例：像專家一樣在規模上提升預訓練數據的質量

摘要

大型語言模型的預訓練傳統上依賴人類專家制定啟發式方法來改善語料庫質量，迄今已發展出眾多規則。然而，這些規則缺乏靈活性，無法有效應對個別示例的獨特特徵。同時，為每個示例應用定制規則對人類專家來說是不切實際的。在本文中，我們展示即使是具有僅 0.3B 參數的小型語言模型也能展現類似於人類專家的顯著數據精煉能力。我們引入了「為每個示例編程」（ProX）這一新穎框架，將數據精煉視為一項編程任務，使模型能夠通過生成和執行細粒度操作（例如字符串標準化）來對每個個別示例進行大規模精煉。實驗結果表明，在 ProX 精心策劃的數據上預訓練的模型在各種下游基準測試中的表現優於原始數據或其他選擇方法篩選的數據超過 2%。其有效性適用於各種模型大小和預訓練語料庫，包括 C4、RedPajama-V2 和 FineWeb。此外，ProX 在特定領域持續預訓練中展現出顯著潛力：在沒有特定領域設計的情況下，通過 ProX 精煉的 OpenWebMath 訓練的模型優於人工設計的基於規則的方法，平均準確率比 Mistral-7B 提高了 7.6%，Llama-2-7B 提高了 14.6%，CodeLlama-7B 提高了 20.3%，均在 10B tokens 的範圍內，可與在 200B tokens 上訓練的 Llemma-7B 等模型相媲美。進一步分析突顯了 ProX 顯著節省了訓練 FLOPs，為高效的 LLM 預訓練開辟了一條有前途的途徑。我們將 ProX 與 >100B 語料庫、模型進行了開源，並分享了所有訓練和實施細節，以實現可重現的研究和未來創新。代碼：https://github.com/GAIR-NLP/ProX

English

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX

程式化每個範例：像專家一樣在規模上提升預訓練數據的質量

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

摘要

Summary

Support

Support