评估语言模型作为合成数据生成器
Evaluating Language Models as Synthetic Data Generators
December 4, 2024
作者: Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig
cs.AI
摘要
随着合成数据在语言模型(LM)后训练中的日益广泛应用,一个LM生成高质量数据的能力几乎与其直接解决问题的能力一样重要。尽管先前的研究侧重于开发有效的数据生成方法,但它们缺乏对不同LM作为数据生成器在统一设置中的系统比较。为填补这一空白,我们提出了AgoraBench,这是一个基准测试,提供了标准化的设置和指标来评估LM的数据生成能力。通过使用6个LM合成了126万个训练实例并训练了99个学生模型,我们揭示了关于LM数据生成能力的关键见解。首先,我们观察到LM表现出不同的优势。例如,GPT-4o擅长生成新问题,而Claude-3.5-Sonnet在增强现有问题方面表现更好。此外,我们的分析揭示了LM的数据生成能力不一定与其解决问题的能力相关。相反,数据质量的多个内在特征,包括响应质量、困惑度和指令难度,共同作为更好的指标。最后,我们证明了在输出格式和成本意识型模型选择方面的战略选择显著影响数据生成的有效性。
English
Given the increasing use of synthetic data in language model (LM)
post-training, an LM's ability to generate high-quality data has become nearly
as crucial as its ability to solve problems directly. While prior works have
focused on developing effective data generation methods, they lack systematic
comparison of different LMs as data generators in a unified setting. To address
this gap, we propose AgoraBench, a benchmark that provides standardized
settings and metrics to evaluate LMs' data generation abilities. Through
synthesizing 1.26 million training instances using 6 LMs and training 99
student models, we uncover key insights about LMs' data generation
capabilities. First, we observe that LMs exhibit distinct strengths. For
instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet
performs better at enhancing existing ones. Furthermore, our analysis reveals
that an LM's data generation ability doesn't necessarily correlate with its
problem-solving ability. Instead, multiple intrinsic features of data
quality-including response quality, perplexity, and instruction
difficulty-collectively serve as better indicators. Finally, we demonstrate
that strategic choices in output format and cost-conscious model selection
significantly impact data generation effectiveness.Summary
AI-Generated Summary