通過從頭開始的可擴展問題合成,釋放LLM的推理能力
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch
October 24, 2024
作者: Yuyang Ding, Xinyu Shi, Xiaobo Liang, Juntao Li, Qiaoming Zhu, Min Zhang
cs.AI
摘要
高質量數據的可用性是提升大型語言模型推理能力的最重要因素之一。現有研究已證明從種子問題或知識庫創建更多指導數據的有效性。最近的研究表明,持續從強大模型(例如GPT-4)擴展數據合成可以進一步引出推理性能。儘管有潛力,但開源社區仍缺乏大規模高質量數據和可負擔成本的可擴展數據合成方法。為解決這一問題,我們引入了ScaleQuest,一種可擴展且新穎的數據合成方法,利用“小型”(例如7B)開源模型從頭生成問題,無需複雜的擴增約束種子數據。通過高效的ScaleQuest,我們自動構建了一個包含100萬個問題-解決方案對的數學推理數據集,比現有的開源數據集更有效。它可以普遍提高主流開源模型的性能(即Mistral、Llama3、DeepSeekMath和Qwen2-Math),在MATH上實現29.2%至46.4%的增益。值得注意的是,僅通過使用我們的數據集對Qwen2-Math-7B-Base模型進行微調,甚至可以超越Qwen2-Math-7B-Instruct,這是一個在閉源數據上強大且良好對齊的模型,以及GPT-4-Turbo和Claude-3.5 Sonnet等專有模型。
English
The availability of high-quality data is one of the most important factors in
improving the reasoning capability of LLMs. Existing works have demonstrated
the effectiveness of creating more instruction data from seed questions or
knowledge bases. Recent research indicates that continually scaling up data
synthesis from strong models (e.g., GPT-4) can further elicit reasoning
performance. Though promising, the open-sourced community still lacks
high-quality data at scale and scalable data synthesis methods with affordable
costs. To address this, we introduce ScaleQuest, a scalable and novel data
synthesis method that utilizes "small-size" (e.g., 7B) open-source models to
generate questions from scratch without the need for seed data with complex
augmentation constraints. With the efficient ScaleQuest, we automatically
constructed a mathematical reasoning dataset consisting of 1 million
problem-solution pairs, which are more effective than existing open-sourced
datasets. It can universally increase the performance of mainstream open-source
models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2%
to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base
model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and
well-aligned model on closed-source data, and proprietary models such as
GPT-4-Turbo and Claude-3.5 Sonnet.Summary
AI-Generated Summary