如何在不发生模型崩溃的情况下合成文本数据？

摘要

在合成数据中出现的模型崩溃表明，在自动生成数据上进行迭代训练会导致性能逐渐下降。随着AI模型的大量出现，合成数据将从根本上重塑网络数据生态系统。未来的GPT-{n}模型必然会在合成数据和人工生成数据的混合中进行训练。在本文中，我们关注两个问题：合成数据对语言模型训练的影响是什么，以及如何在不发生模型崩溃的情况下合成数据？我们首先在不同比例的合成数据上预训练语言模型，揭示了合成数据比例与模型性能之间的负相关关系。我们进一步对合成数据进行统计分析，揭示了分布偏移现象和n-gram特征的过度集中。受以上发现启发，我们提出对人工生成数据进行标记编辑，以获得半合成数据。作为概念验证，我们在理论上证明了标记级别编辑可以防止模型崩溃，因为测试误差受到有限上限的限制。我们进行了大量实验，包括从头开始的预训练、持续预训练和监督微调。结果验证了我们的理论证明，即标记级别编辑提高了数据质量并增强了模型性能。

English

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

如何在不发生模型崩溃的情况下合成文本数据？

How to Synthesize Text Data without Model Collapse?

摘要

Summary

Support