如何在不发生模型崩溃的情况下合成文本数据?
How to Synthesize Text Data without Model Collapse?
December 19, 2024
作者: Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou
cs.AI
摘要
在合成数据中出现的模型崩溃表明,在自动生成数据上进行迭代训练会导致性能逐渐下降。随着AI模型的大量出现,合成数据将从根本上重塑网络数据生态系统。未来的GPT-{n}模型必然会在合成数据和人工生成数据的混合中进行训练。在本文中,我们关注两个问题:合成数据对语言模型训练的影响是什么,以及如何在不发生模型崩溃的情况下合成数据?我们首先在不同比例的合成数据上预训练语言模型,揭示了合成数据比例与模型性能之间的负相关关系。我们进一步对合成数据进行统计分析,揭示了分布偏移现象和n-gram特征的过度集中。受以上发现启发,我们提出对人工生成数据进行标记编辑,以获得半合成数据。作为概念验证,我们在理论上证明了标记级别编辑可以防止模型崩溃,因为测试误差受到有限上限的限制。我们进行了大量实验,包括从头开始的预训练、持续预训练和监督微调。结果验证了我们的理论证明,即标记级别编辑提高了数据质量并增强了模型性能。
English
Model collapse in synthetic data indicates that iterative training on
self-generated data leads to a gradual decline in performance. With the
proliferation of AI models, synthetic data will fundamentally reshape the web
data ecosystem. Future GPT-{n} models will inevitably be trained on a blend
of synthetic and human-produced data. In this paper, we focus on two questions:
what is the impact of synthetic data on language model training, and how to
synthesize data without model collapse? We first pre-train language models
across different proportions of synthetic data, revealing a negative
correlation between the proportion of synthetic data and model performance. We
further conduct statistical analysis on synthetic data to uncover
distributional shift phenomenon and over-concentration of n-gram features.
Inspired by the above findings, we propose token editing on human-produced data
to obtain semi-synthetic data. As a proof of concept, we theoretically
demonstrate that token-level editing can prevent model collapse, as the test
error is constrained by a finite upper bound. We conduct extensive experiments
on pre-training from scratch, continual pre-training, and supervised
fine-tuning. The results validate our theoretical proof that token-level
editing improves data quality and enhances model performance.Summary
AI-Generated Summary