OpenCSG 中文語料庫：用於 LLM 訓練的一系列高質量中文數據集

摘要

大型語言模型（LLMs）展示了卓越的能力，但它們的成功在很大程度上取決於預訓練語料庫的質量。對於中文LLMs來說，高質量中文數據集的稀缺性帶來了重大挑戰，通常限制了它們的性能。為了應對這一問題，我們提出了OpenCSG中文語料庫，這是一系列專門為LLM的預訓練、後訓練和微調而設計的高質量數據集。該語料庫包括Fineweb-edu-chinese、Fineweb-edu-chinese-v2、Cosmopedia-chinese和Smoltalk-chinese，每個數據集都具有獨特特徵：Fineweb-edu數據集聚焦於來自不同中文網絡來源的經過過濾的高質量內容；Cosmopedia-chinese提供了用於知識密集型訓練的合成、教科書風格數據；而Smoltalk-chinese則強調風格多樣的聊天格式數據。OpenCSG中文語料庫以其高質量文本、跨領域多樣性覆蓋和可擴展、可重現的數據整理過程為特點。此外，我們進行了廣泛的實驗分析，包括對較小參數模型的評估，顯示在C-Eval等任務中取得了顯著的性能改善，突顯了該語料庫對訓練中文LLMs的有效性。

English

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.

OpenCSG 中文語料庫：用於 LLM 訓練的一系列高質量中文數據集

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

摘要

Summary

Support