MetaSynth:基于元提示驱动的智能框架,实现多样化合成数据生成
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
April 17, 2025
作者: Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood
cs.AI
摘要
近期的小型语言模型如Phi-3.5和Phi-4依赖于利用更大语言模型生成的合成数据。关于如何将合成数据应用于其他场景,例如使大语言模型适应特定领域,仍存在诸多疑问。合成数据的一个关键局限在于其多样性不足,这对其在提升其他模型性能方面的下游应用产生了负面影响。为解决这一问题,我们提出了MetaSynth方法,通过元提示(meta-prompting)生成增强多样性的合成数据,即由一个语言模型协调多个“专家”LLM代理协作生成数据。仅使用MetaSynth生成的2500万token合成数据,我们成功地将一个训练有素的大语言模型(Mistral-7B-v0.3)适配到两个专业领域——金融与生物医学,且未损害模型在通用任务上的能力。此外,我们采用七项自动化指标评估了合成数据的多样性,发现其接近大语言模型预训练语料的多样性水平。
持续使用MetaSynth对Mistral-7B-v0.3进行预训练显著超越了基础大语言模型,在金融领域提升了最高达4.08%,在生物医学领域提升了13.75%。相比之下,当模型使用基于模板提示生成的数据进行训练时,即使模板包含了先前的生成结果和多样化的真实数据上下文示例,其性能仍有所下降。我们的研究结果表明,在使用MetaSynth的情况下,仅需数百万token的多样化合成数据,无需混合任何真实数据,即可实现有效的领域适应。
English
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data
generated using larger Language models. Questions remain about leveraging
synthetic data for other use cases, such as adapting LLMs to specific domains.
A key limitation of synthetic data is low diversity, which negatively impacts
its downstream applicability for improving other models. To address this, we
propose MetaSynth, a method for generating synthetic data that enhances
diversity through meta-prompting, where a language model orchestrates multiple
"expert" LLM agents to collaboratively generate data. Using only 25 million
tokens of synthetic data generated with MetaSynth, we successfully adapt a
well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and
Biomedicine-without compromising the capabilities of the resulting model in
general tasks. In addition, we evaluate the diversity of our synthetic data
using seven automated metrics, and find that it approaches the diversity of LLM
pre-training corpora.
Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms
the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in
Biomedicine. The same model shows degraded performance when trained on data
generated using a template prompt, even when the template includes prior
generations and varying In-Context exemplars of real data. Our findings suggest
that a few million tokens of diverse synthetic data without mixing any real
data, is sufficient for effective domain adaptation when using MetaSynth.Summary
AI-Generated Summary