評估語言模型作為合成數據生成器
Evaluating Language Models as Synthetic Data Generators
December 4, 2024
作者: Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig
cs.AI
摘要
隨著在語言模型(LM)後訓練中合成數據的使用不斷增加,一個LM生成高質量數據的能力幾乎與其直接解決問題的能力一樣重要。雖然先前的研究專注於開發有效的數據生成方法,但它們缺乏對不同LM作為數據生成器在統一環境中的系統性比較。為了填補這一空白,我們提出了AgoraBench,一個提供標準化設置和指標來評估LM數據生成能力的基準。通過使用6個LM合成126萬個訓練實例並訓練99個學生模型,我們揭示了有關LM數據生成能力的關鍵見解。首先,我們觀察到LM表現出不同的優勢。例如,GPT-4o擅長生成新問題,而Claude-3.5-Sonnet在增強現有問題方面表現更好。此外,我們的分析顯示,LM的數據生成能力不一定與其解決問題的能力相關。相反,數據質量的多個內在特徵,包括回應質量、困惑度和指示難度,共同作為更好的指標。最後,我們展示了在輸出格式和成本意識型模型選擇方面的策略性選擇對數據生成效果產生重大影響。
English
Given the increasing use of synthetic data in language model (LM)
post-training, an LM's ability to generate high-quality data has become nearly
as crucial as its ability to solve problems directly. While prior works have
focused on developing effective data generation methods, they lack systematic
comparison of different LMs as data generators in a unified setting. To address
this gap, we propose AgoraBench, a benchmark that provides standardized
settings and metrics to evaluate LMs' data generation abilities. Through
synthesizing 1.26 million training instances using 6 LMs and training 99
student models, we uncover key insights about LMs' data generation
capabilities. First, we observe that LMs exhibit distinct strengths. For
instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet
performs better at enhancing existing ones. Furthermore, our analysis reveals
that an LM's data generation ability doesn't necessarily correlate with its
problem-solving ability. Instead, multiple intrinsic features of data
quality-including response quality, perplexity, and instruction
difficulty-collectively serve as better indicators. Finally, we demonstrate
that strategic choices in output format and cost-conscious model selection
significantly impact data generation effectiveness.Summary
AI-Generated Summary