InfiMM-WebMath-40B:推進多模態預訓練以增強數學推理
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
September 19, 2024
作者: Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You
cs.AI
摘要
在大规模、高质量数据集上进行预训练对于增强大型语言模型(LLMs)的推理能力至关重要,尤其是在数学等专业领域。尽管人们认识到其重要性,多模态语言模型(MLLMs)领域目前缺乏专门为数学推理设计的全面开源预训练数据集。为填补这一空白,我们介绍了InfiMM-WebMath-40B,这是一个高质量的交错图像-文本文档数据集。它包括了2400万个网页、8500万个相关图像URL和400亿个文本标记,所有这些都是精心从CommonCrawl中提取和过滤而来。我们提供了对我们数据收集和处理流程的详细概述。为了展示InfiMM-WebMath-40B的稳健性,我们在纯文本和多模态设置下进行了评估。我们在纯文本基准测试中的评估表明,尽管仅利用了400亿个标记,我们的数据集显著提升了我们的13亿模型的性能,提供了与使用1200亿标记的DeepSeekMath-1.3B相当的结果。然而,引入我们的多模态数学预训练数据集后,我们的模型在MathVerse和We-Math等多模态数学基准测试中取得了新的开源模型最先进的成果。我们在https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B发布了我们的数据。
English
Pre-training on large-scale, high-quality datasets is crucial for enhancing
the reasoning capabilities of Large Language Models (LLMs), especially in
specialized domains such as mathematics. Despite the recognized importance, the
Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source
pre-training dataset specifically designed for mathematical reasoning. To
address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of
interleaved image-text documents. It comprises 24 million web pages, 85 million
associated image URLs, and 40 billion text tokens, all meticulously extracted
and filtered from CommonCrawl. We provide a detailed overview of our data
collection and processing pipeline. To demonstrate the robustness of
InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal
settings. Our evaluations on text-only benchmarks show that, despite utilizing
only 40 billion tokens, our dataset significantly enhances the performance of
our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses
120 billion tokens for the same model size. Nevertheless, with the introduction
of our multi-modal math pre-training dataset, our models set a new
state-of-the-art among open-source models on multi-modal math benchmarks such
as MathVerse and We-Math. We release our data at
https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B.Summary
AI-Generated Summary