ChatPaper.aiChatPaper

MAGA:大规模流派-受众重塑预训练语料库扩展

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

February 6, 2025
作者: Xintong Hao, Ke Shen, Chenggang Li
cs.AI

摘要

尽管大型语言模型在各种任务中具有显著的能力,但它们持续扩展面临一个关键挑战:高质量的预训练数据稀缺。虽然模型架构不断发展,但自然语言数据难以扩展。为了解决这一瓶颈,我们提出了大规模类型-受众(MAGA)重构方法,系统地从现有语料库中合成多样化、上下文丰富的预训练数据。这项工作有三个主要贡献:(1)我们提出了MAGA重构方法,这是一种轻量级且可扩展的预训练语料库扩展方法,并构建了一个包含770B标记的MAGACorpus。 (2)我们使用不同的数据预算扩展策略评估了MAGACorpus,展示了在各种模型规模(134M-13B)上持续改进,确立了下一代大规模合成预训练语言模型的必要性。 (3)通过全面分析,我们研究了提示工程对合成训练崩溃的影响,并揭示了传统崩溃检测指标在验证损失方面的局限性。我们的工作表明,MAGA能够大幅扩展训练数据集,同时保持质量,为超越数据限制扩展模型提供了可靠的途径。
English
Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose MAssive Genre-Audience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.

Summary

AI-Generated Summary

PDF212February 7, 2025