OpenCSG 中文語料庫:用於 LLM 訓練的一系列高質量中文數據集
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
January 14, 2025
作者: Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei
cs.AI
摘要
大型語言模型(LLMs)展示了卓越的能力,但它們的成功在很大程度上取決於預訓練語料庫的質量。對於中文LLMs來說,高質量中文數據集的稀缺性帶來了重大挑戰,通常限制了它們的性能。為了應對這一問題,我們提出了OpenCSG中文語料庫,這是一系列專門為LLM的預訓練、後訓練和微調而設計的高質量數據集。該語料庫包括Fineweb-edu-chinese、Fineweb-edu-chinese-v2、Cosmopedia-chinese和Smoltalk-chinese,每個數據集都具有獨特特徵:Fineweb-edu數據集聚焦於來自不同中文網絡來源的經過過濾的高質量內容;Cosmopedia-chinese提供了用於知識密集型訓練的合成、教科書風格數據;而Smoltalk-chinese則強調風格多樣的聊天格式數據。OpenCSG中文語料庫以其高質量文本、跨領域多樣性覆蓋和可擴展、可重現的數據整理過程為特點。此外,我們進行了廣泛的實驗分析,包括對較小參數模型的評估,顯示在C-Eval等任務中取得了顯著的性能改善,突顯了該語料庫對訓練中文LLMs的有效性。
English
Large language models (LLMs) have demonstrated remarkable capabilities, but
their success heavily relies on the quality of pretraining corpora. For Chinese
LLMs, the scarcity of high-quality Chinese datasets presents a significant
challenge, often limiting their performance. To address this issue, we propose
the OpenCSG Chinese Corpus, a series of high-quality datasets specifically
designed for LLM pretraining, post-training, and fine-tuning. This corpus
includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and
Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets
focus on filtered, high-quality content derived from diverse Chinese web
sources; Cosmopedia-chinese provides synthetic, textbook-style data for
knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and
diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its
high-quality text, diverse coverage across domains, and scalable, reproducible
data curation processes. Additionally, we conducted extensive experimental
analyses, including evaluations on smaller parameter models, which demonstrated
significant performance improvements in tasks such as C-Eval, showcasing the
effectiveness of the corpus for training Chinese LLMs.Summary
AI-Generated Summary