如何在不出現模型崩潰的情況下合成文本數據？

摘要

合成數據中的模型崩潰表明在自行生成的數據上進行迭代訓練將導致性能逐漸下降。隨著 AI 模型的激增，合成數據將從根本上重塑 Web 數據生態系統。未來的 GPT-{n} 模型將不可避免地在合成和人工生成的數據混合上進行訓練。在本文中，我們聚焦於兩個問題：合成數據對語言模型訓練的影響是什麼，以及如何在不出現模型崩潰的情況下合成數據？我們首先在不同比例的合成數據上對語言模型進行預訓練，揭示了合成數據比例與模型性能之間的負相關。我們進一步對合成數據進行統計分析，揭示了分布變化現象和 n-gram 特徵的過度集中。受以上發現的啟發，我們提出對人工生成的數據進行標記編輯，以獲得半合成數據。作為概念證明，我們在理論上證明了標記級編輯可以防止模型崩潰，因為測試誤差受到有限上限的限制。我們對從頭開始的預訓練、持續預訓練和監督微調進行了大量實驗。結果驗證了我們的理論證明，即標記級編輯提高了數據質量並增強了模型性能。

English

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

如何在不出現模型崩潰的情況下合成文本數據？

How to Synthesize Text Data without Model Collapse?

摘要

Support