评估语言模型作为合成数据生成器

摘要

随着合成数据在语言模型（LM）后训练中的日益广泛应用，一个LM生成高质量数据的能力几乎与其直接解决问题的能力一样重要。尽管先前的研究侧重于开发有效的数据生成方法，但它们缺乏对不同LM作为数据生成器在统一设置中的系统比较。为填补这一空白，我们提出了AgoraBench，这是一个基准测试，提供了标准化的设置和指标来评估LM的数据生成能力。通过使用6个LM合成了126万个训练实例并训练了99个学生模型，我们揭示了关于LM数据生成能力的关键见解。首先，我们观察到LM表现出不同的优势。例如，GPT-4o擅长生成新问题，而Claude-3.5-Sonnet在增强现有问题方面表现更好。此外，我们的分析揭示了LM的数据生成能力不一定与其解决问题的能力相关。相反，数据质量的多个内在特征，包括响应质量、困惑度和指令难度，共同作为更好的指标。最后，我们证明了在输出格式和成本意识型模型选择方面的战略选择显著影响数据生成的有效性。

English

Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.

评估语言模型作为合成数据生成器

Evaluating Language Models as Synthetic Data Generators

摘要

Summary

Support

Support