MegaMath:突破开放数学语料库的极限
MegaMath: Pushing the Limits of Open Math Corpora
April 3, 2025
作者: Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, Eric P. Xing
cs.AI
摘要
数学推理是人类智能的基石,也是衡量大型语言模型(LLMs)高级能力的关键指标。然而,研究界仍缺乏一个开放、大规模、高质量的语料库,专门满足以数学为核心的LLM预训练需求。为此,我们推出了MegaMath,这是一个通过以下实践从多样化的数学相关资源中精心构建的开放数据集:(1) 重新审视网络数据:我们通过数学导向的HTML优化、基于fasttext的过滤与去重,从Common Crawl中重新提取数学文档,旨在获取互联网上更高质量的数据。(2) 召回数学相关代码数据:我们从大型代码训练语料库Stack-V2中筛选出高质量的数学相关代码,进一步丰富了数据的多样性。(3) 探索合成数据:我们基于网络数据或代码数据,合成了问答式文本、数学相关代码以及文本与代码交织的块。通过整合这些策略,并通过广泛的消融实验验证其有效性,MegaMath提供了371B个标记,在现有开放的数学预训练数据集中,无论是数量还是质量均位居前列。
English
Mathematical reasoning is a cornerstone of human intelligence and a key
benchmark for advanced capabilities in large language models (LLMs). However,
the research community still lacks an open, large-scale, high-quality corpus
tailored to the demands of math-centric LLM pre-training. We present MegaMath,
an open dataset curated from diverse, math-focused sources through following
practices: (1) Revisiting web data: We re-extracted mathematical documents from
Common Crawl with math-oriented HTML optimizations, fasttext-based filtering
and deduplication, all for acquiring higher-quality data on the Internet. (2)
Recalling Math-related code data: We identified high quality math-related code
from large code training corpus, Stack-V2, further enhancing data diversity.
(3) Exploring Synthetic data: We synthesized QA-style text, math-related code,
and interleaved text-code blocks from web data or code data. By integrating
these strategies and validating their effectiveness through extensive
ablations, MegaMath delivers 371B tokens with the largest quantity and top
quality among existing open math pre-training datasets.Summary
AI-Generated Summary