템플릿 기반 데이터 생성을 사용한 언어 모델의 훈련과 평가

초록

대규모 언어 모델 (LLM)인 GPT-3, PaLM 및 Llama와 같은 LLM의 신속한 발전은 자연어 처리를 혁신적으로 변화시켰으며, 언어 이해 및 생성에서 놀라운 능력을 보여주었습니다. 그러나 이러한 모델들은 종종 복잡한 추론을 필요로 하는 작업에서 어려움을 겪는데, 특히 수학 문제 해결에서는 대규모이고 고품질이며 도메인 특화된 데이터셋의 부족 때문에 정교한 추론 능력을 훈련시키기 위해서 필요한 것입니다. 이 한계를 극복하기 위해 우리는 템플릿 기반 데이터 생성 (TDG)을 소개합니다. 이는 LLM(GPT-4)을 활용하여 자동으로 매개변수화된 메타 템플릿을 생성하고, 이를 사용하여 다양한 고품질 문제와 해결책을 종합적으로 합성합니다. TDG를 활용하여 우리는 TemplateMath Part I: TemplateGSM을 만들었습니다. 이는 700만 개가 넘는 합성으로 생성된 초등학교 수학 문제로 구성된 데이터셋으로, 각 문제는 코드 기반 및 자연어 해결책이 함께 제공되며, 효과적으로 무한히 더 많은 문제를 생성할 수 있습니다. 이 데이터셋은 대규모 수학 데이터셋의 부족을 완화시키며, 수학적 추론에서 LLM의 사전 훈련, 세부 조정 및 평가에 유용한 자원으로 기능합니다. 우리의 방법은 거의 무한한 데이터 생성뿐만 아니라 GPT-4를 사용하여 메타 템플릿 생성을 통해 데이터 증강을 새로운 수준으로 끌어올려, 다양하고 고품질의 문제 구조를 보장합니다. TemplateMath Part I: TemplateGSM 데이터셋은 https://huggingface.co/datasets/math-ai/TemplateGSM에서 공개적으로 이용 가능하며, 코드는 https://github.com/iiis-ai/TemplateMath에서 확인할 수 있습니다.

English

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems--each accompanied by code-based and natural language solutions--with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available at https://github.com/iiis-ai/TemplateMath.

템플릿 기반 데이터 생성을 사용한 언어 모델의 훈련과 평가

Training and Evaluating Language Models with Template-based Data Generation

초록

Summary

Support