使用基於模板的數據生成訓練和評估語言模型
Training and Evaluating Language Models with Template-based Data Generation
November 27, 2024
作者: Yifan Zhang
cs.AI
摘要
大型語言模型(LLMs)如GPT-3、PaLM和Llama的快速發展顯著改變了自然語言處理,展示出在理解和生成語言方面的卓越能力。然而,這些模型在需要複雜推理的任務中通常遇到困難,特別是在數學問題解決方面,部分原因是缺乏用於訓練複雜推理能力所需的大規模、高質量、特定領域的數據集。為了解決這一限制,我們引入了基於模板的數據生成(TDG)方法,這是一種新穎的方法,利用LLMs(GPT-4)自動生成參數化的元模板,然後用於合成各種高質量問題和解決方案。利用TDG,我們創建了TemplateMath Part I: TemplateGSM數據集,包括超過700萬個合成生成的小學數學問題,每個問題都附有基於代碼和自然語言的解決方案,並具有生成無限數量問題的潛力。這個數據集緩解了大規模數學數據集的稀缺問題,並為LLMs在數學推理中的預訓練、微調和評估提供了寶貴資源。我們的方法不僅能夠生成幾乎無限的數據,還通過使用GPT-4進行元模板生成,將數據擴增提升到一個新水平,確保多樣且高質量的問題結構。TemplateMath Part I: TemplateGSM數據集可在https://huggingface.co/datasets/math-ai/TemplateGSM公開獲得。代碼可在https://github.com/iiis-ai/TemplateMath獲得。
English
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM,
and Llama has significantly transformed natural language processing, showcasing
remarkable capabilities in understanding and generating language. However,
these models often struggle with tasks requiring complex reasoning,
particularly in mathematical problem-solving, due in part to the scarcity of
large-scale, high-quality, domain-specific datasets necessary for training
sophisticated reasoning abilities. To address this limitation, we introduce
Template-based Data Generation (TDG), a novel approach that leverages LLMs
(GPT-4) to automatically generate parameterized meta-templates, which are then
used to synthesize a vast array of high-quality problems and solutions.
Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset
comprising over 7 million synthetically generated grade school math
problems--each accompanied by code-based and natural language solutions--with
the potential to generate an effectively unlimited number more. This dataset
alleviates the scarcity of large-scale mathematical datasets and serves as a
valuable resource for pre-training, fine-tuning, and evaluating LLMs in
mathematical reasoning. Our method not only enables the generation of virtually
infinite data but also elevates data augmentation to a new level by using GPT-4
for meta-template generation, ensuring diverse and high-quality problem
structures. The TemplateMath Part I: TemplateGSM dataset is publicly available
at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available
at https://github.com/iiis-ai/TemplateMath.Summary
AI-Generated Summary