評估語言模型作為合成數據生成器

摘要

隨著在語言模型（LM）後訓練中合成數據的使用不斷增加，一個LM生成高質量數據的能力幾乎與其直接解決問題的能力一樣重要。雖然先前的研究專注於開發有效的數據生成方法，但它們缺乏對不同LM作為數據生成器在統一環境中的系統性比較。為了填補這一空白，我們提出了AgoraBench，一個提供標準化設置和指標來評估LM數據生成能力的基準。通過使用6個LM合成126萬個訓練實例並訓練99個學生模型，我們揭示了有關LM數據生成能力的關鍵見解。首先，我們觀察到LM表現出不同的優勢。例如，GPT-4o擅長生成新問題，而Claude-3.5-Sonnet在增強現有問題方面表現更好。此外，我們的分析顯示，LM的數據生成能力不一定與其解決問題的能力相關。相反，數據質量的多個內在特徵，包括回應質量、困惑度和指示難度，共同作為更好的指標。最後，我們展示了在輸出格式和成本意識型模型選擇方面的策略性選擇對數據生成效果產生重大影響。

English

Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.

評估語言模型作為合成數據生成器

Evaluating Language Models as Synthetic Data Generators

摘要

Support