ChatPaper.aiChatPaper

MetaSynth:基於元提示驅動的代理框架,實現多樣化合成數據生成

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

April 17, 2025
作者: Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood
cs.AI

摘要

近期如Phi-3.5和Phi-4等較小型的語言模型,依賴於利用更大規模語言模型生成的合成數據。然而,關於如何將合成數據應用於其他用途,例如使大型語言模型(LLMs)適應特定領域,仍存在疑問。合成數據的一個主要限制是其多樣性不足,這對其用於改進其他模型的下游應用產生了負面影響。為解決這一問題,我們提出了MetaSynth,這是一種通過元提示(meta-prompting)來生成合成數據的方法,其中一個語言模型協調多個“專家”LLM代理協作生成數據。僅使用MetaSynth生成的2500萬個token的合成數據,我們成功將一個訓練良好的LLM(Mistral-7B-v0.3)適應到兩個專業領域——金融和生物醫學——且未損害模型在通用任務中的能力。此外,我們使用七種自動化指標評估了合成數據的多樣性,發現其接近LLM預訓練語料庫的多樣性。 持續使用MetaSynth對Mistral-7B-v0.3進行預訓練,顯著超越了基礎LLM,在金融領域提升了高達4.08%,在生物醫學領域提升了13.75%。當使用模板提示生成的數據進行訓練時,即使模板包含先前的生成結果和不同的真實數據上下文示例,同一模型的性能也會下降。我們的研究表明,在使用MetaSynth時,僅需數百萬個token的多樣性合成數據,無需混合任何真實數據,即可實現有效的領域適應。
English
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

Summary

AI-Generated Summary

PDF32April 18, 2025