從大型語言模型產生的合成數據中調查質量、多樣性和複雜性的影響

摘要

利用大型語言模型進行合成數據生成是擴展自然數據在幾乎無限範圍任務上的一種有前途的範式。鑒於這種多樣性，合成數據生成算法之間的直接比較很少，這使得理解改進來源和存在的瓶頸變得困難。我們建議通過每個算法生成的合成數據的質量、多樣性和複雜性來評估算法。我們選擇這三個特徵是因為它們在開放式過程中的重要性以及它們對下游模型能力的影響。我們發現質量對於分佈內模型泛化至關重要，多樣性對於分佈外泛化至關重要，而複雜性對兩者都有益。此外，我們強調在訓練數據中存在質量-多樣性的權衡以及對模型性能的下游影響。然後，我們檢驗合成數據流程中各個組件對每個數據特徵的影響。這種檢驗使我們能夠通過它們利用的組件以及對數據QDC組成產生的影響來對合成數據生成算法進行分類和比較。這種分析延伸到對在合成數據中平衡QDC對於有效的強化學習和自我改進算法的重要性的討論。類似於訓練數據中的QD權衡，通常存在模型輸出質量和輸出多樣性之間的權衡，這影響了合成數據的組成。我們觀察到目前許多模型僅被評估和優化為輸出質量，從而限制了輸出多樣性和自我改進的潛力。我們認為平衡這些權衡對未來自我改進算法的發展至關重要，並強調了一些在這方面取得進展的工作。

English

Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.

從大型語言模型產生的合成數據中調查質量、多樣性和複雜性的影響

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

摘要

Support