ChatPaper.aiChatPaper

使用基于模板的数据生成训练和评估语言模型

Training and Evaluating Language Models with Template-based Data Generation

November 27, 2024
作者: Yifan Zhang
cs.AI

摘要

大型语言模型(LLMs)如GPT-3、PaLM和Llama的快速发展显著改变了自然语言处理,展示了在理解和生成语言方面的显著能力。然而,这些模型在需要复杂推理的任务中通常表现不佳,特别是在数学问题解决方面,部分原因是由于缺乏用于训练复杂推理能力所需的大规模、高质量、领域特定的数据集。为了解决这一限制,我们引入了基于模板的数据生成(TDG)方法,这是一种利用LLMs(GPT-4)自动生成参数化元模板的新方法,然后用于合成大量高质量问题和解决方案。利用TDG,我们创建了TemplateMath Part I: TemplateGSM数据集,包括超过700万个合成生成的小学数学问题,每个问题都附带基于代码和自然语言的解决方案,有潜力生成更多问题。该数据集缓解了大规模数学数据集的稀缺问题,并为在数学推理中预训练、微调和评估LLMs提供了宝贵资源。我们的方法不仅能够生成几乎无限的数据,还通过使用GPT-4进行元模板生成,将数据增强提升到一个新水平,确保多样化和高质量的问题结构。TemplateMath Part I: TemplateGSM数据集可在https://huggingface.co/datasets/math-ai/TemplateGSM公开获取。代码可在https://github.com/iiis-ai/TemplateMath获取。
English
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems--each accompanied by code-based and natural language solutions--with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available at https://github.com/iiis-ai/TemplateMath.

Summary

AI-Generated Summary

PDF33November 28, 2024