ChatPaper.aiChatPaper

MegaMath:突破開放數學語料庫的極限

MegaMath: Pushing the Limits of Open Math Corpora

April 3, 2025
作者: Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, Eric P. Xing
cs.AI

摘要

數學推理是人類智能的基石,也是衡量大型語言模型(LLMs)高級能力的關鍵指標。然而,研究界仍缺乏一個開放、大規模、高質量的語料庫,專門滿足以數學為中心的LLM預訓練需求。我們推出了MegaMath,這是一個通過以下實踐從多樣化的數學相關來源精心策劃的開放數據集:(1) 重訪網絡數據:我們從Common Crawl重新提取了數學文檔,並進行了面向數學的HTML優化、基於fasttext的過濾和去重,旨在從互聯網上獲取更高質量的數據。(2) 召回數學相關代碼數據:我們從大型代碼訓練語料庫Stack-V2中識別出高質量的數學相關代碼,進一步增強了數據的多樣性。(3) 探索合成數據:我們從網絡數據或代碼數據中合成了問答式文本、數學相關代碼以及交織的文本-代碼塊。通過整合這些策略並通過廣泛的消融實驗驗證其有效性,MegaMath提供了371B個token,在現有的開放數學預訓練數據集中,無論是數量還是質量都位居榜首。
English
Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

Summary

AI-Generated Summary

PDF292April 7, 2025