OpenCSG中文语料库:用于LLM训练的一系列高质量中文数据集
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
January 14, 2025
作者: Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei
cs.AI
摘要
大型语言模型(LLMs)展示了显著的能力,但它们的成功在很大程度上取决于预训练语料库的质量。对于中文LLMs,高质量中文数据集的稀缺性构成了一个重要挑战,经常限制了它们的性能。为了解决这个问题,我们提出了OpenCSG中文语料库,这是一系列专门为LLM预训练、后训练和微调而设计的高质量数据集。该语料库包括Fineweb-edu-chinese、Fineweb-edu-chinese-v2、Cosmopedia-chinese和Smoltalk-chinese,每个数据集都具有独特的特点:Fineweb-edu数据集侧重于来自不同中文网站的经过筛选的高质量内容;Cosmopedia-chinese提供了用于知识密集型训练的合成的、类似教科书风格的数据;而Smoltalk-chinese强调风格各异的聊天格式数据。OpenCSG中文语料库以其高质量文本、跨领域的多样覆盖和可扩展、可复现的数据整理过程为特点。此外,我们进行了广泛的实验分析,包括对较小参数模型的评估,结果显示在诸如C-Eval之类的任务中取得了显著的性能改进,展示了该语料库对于训练中文LLMs的有效性。
English
Large language models (LLMs) have demonstrated remarkable capabilities, but
their success heavily relies on the quality of pretraining corpora. For Chinese
LLMs, the scarcity of high-quality Chinese datasets presents a significant
challenge, often limiting their performance. To address this issue, we propose
the OpenCSG Chinese Corpus, a series of high-quality datasets specifically
designed for LLM pretraining, post-training, and fine-tuning. This corpus
includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and
Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets
focus on filtered, high-quality content derived from diverse Chinese web
sources; Cosmopedia-chinese provides synthetic, textbook-style data for
knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and
diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its
high-quality text, diverse coverage across domains, and scalable, reproducible
data curation processes. Additionally, we conducted extensive experimental
analyses, including evaluations on smaller parameter models, which demonstrated
significant performance improvements in tasks such as C-Eval, showcasing the
effectiveness of the corpus for training Chinese LLMs.Summary
AI-Generated Summary