모델 붕괴 없이 텍스트 데이터를 합성하는 방법은 무엇인가요?

초록

합성 데이터에서의 모델 붕괴는 자체 생성된 데이터에 대한 반복적인 훈련이 성능 저하로 이어진다는 것을 나타낸다. AI 모델의 증가로 인해, 합성 데이터는 웹 데이터 생태계를 근본적으로 변화시킬 것이다. 미래의 GPT-{n} 모델은 불가피하게 합성 및 인간 제작 데이터의 혼합으로 훈련될 것이다. 본 논문에서는 합성 데이터가 언어 모델 훈련에 미치는 영향과 모델 붕괴 없이 데이터를 합성하는 방법에 초점을 맞춘다. 우리는 먼저 다양한 비율의 합성 데이터를 사용하여 언어 모델 사전 훈련을 실시하고, 합성 데이터의 비율과 모델 성능 사이의 부정적 상관 관계를 밝혀냈다. 또한 합성 데이터에 대한 통계 분석을 통해 분포 이동 현상과 n-그램 특징의 과도한 집중을 발견했다. 위 발견을 바탕으로 우리는 인간 제작 데이터에 대한 토큰 편집을 제안하여 반 합성 데이터를 얻었다. 개념 증명으로, 우리는 토큰 수준의 편집이 모델 붕괴를 방지할 수 있음을 이론적으로 증명하였다. 실험에서 테스트 오류가 한정된 상한으로 제한되므로 토큰 수준의 편집이 모델 성능을 향상시키는 것을 보여주었다. 우리는 처음부터 사전 훈련, 계속적인 사전 훈련 및 지도된 세밀한 조정에 대해 포괄적인 실험을 실시했다. 결과는 토큰 수준의 편집이 데이터 품질을 향상시키고 모델 성능을 향상시킨다는 우리의 이론적 증명을 확인한다.

English

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

모델 붕괴 없이 텍스트 데이터를 합성하는 방법은 무엇인가요?

How to Synthesize Text Data without Model Collapse?

초록

Support