MetaSynth:基於元提示驅動的代理框架,實現多樣化合成數據生成
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
April 17, 2025
作者: Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood
cs.AI
摘要
近期如Phi-3.5和Phi-4等較小型的語言模型,依賴於利用更大規模語言模型生成的合成數據。然而,關於如何將合成數據應用於其他用途,例如使大型語言模型(LLMs)適應特定領域,仍存在疑問。合成數據的一個主要限制是其多樣性不足,這對其用於改進其他模型的下游應用產生了負面影響。為解決這一問題,我們提出了MetaSynth,這是一種通過元提示(meta-prompting)來生成合成數據的方法,其中一個語言模型協調多個“專家”LLM代理協作生成數據。僅使用MetaSynth生成的2500萬個token的合成數據,我們成功將一個訓練良好的LLM(Mistral-7B-v0.3)適應到兩個專業領域——金融和生物醫學——且未損害模型在通用任務中的能力。此外,我們使用七種自動化指標評估了合成數據的多樣性,發現其接近LLM預訓練語料庫的多樣性。
持續使用MetaSynth對Mistral-7B-v0.3進行預訓練,顯著超越了基礎LLM,在金融領域提升了高達4.08%,在生物醫學領域提升了13.75%。當使用模板提示生成的數據進行訓練時,即使模板包含先前的生成結果和不同的真實數據上下文示例,同一模型的性能也會下降。我們的研究表明,在使用MetaSynth時,僅需數百萬個token的多樣性合成數據,無需混合任何真實數據,即可實現有效的領域適應。
English
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data
generated using larger Language models. Questions remain about leveraging
synthetic data for other use cases, such as adapting LLMs to specific domains.
A key limitation of synthetic data is low diversity, which negatively impacts
its downstream applicability for improving other models. To address this, we
propose MetaSynth, a method for generating synthetic data that enhances
diversity through meta-prompting, where a language model orchestrates multiple
"expert" LLM agents to collaboratively generate data. Using only 25 million
tokens of synthetic data generated with MetaSynth, we successfully adapt a
well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and
Biomedicine-without compromising the capabilities of the resulting model in
general tasks. In addition, we evaluate the diversity of our synthetic data
using seven automated metrics, and find that it approaches the diversity of LLM
pre-training corpora.
Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms
the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in
Biomedicine. The same model shows degraded performance when trained on data
generated using a template prompt, even when the template includes prior
generations and varying In-Context exemplars of real data. Our findings suggest
that a few million tokens of diverse synthetic data without mixing any real
data, is sufficient for effective domain adaptation when using MetaSynth.Summary
AI-Generated Summary