CCI3.0-HQ:一個大規模的中文高質量數據集,旨在為預訓練大型語言模型而設計。
CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models
October 24, 2024
作者: Liangdong Wang, Bo-Wen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi, Shuhao Gu, Jijie Li, Quanyue Ma, TengFei Pan, Guang Liu
cs.AI
摘要
我們介紹 CCI3.0-HQ(https://huggingface.co/datasets/BAAI/CCI3-HQ),這是中文語料庫互聯網 3.0(CCI3.0)的高質量500GB子集(https://huggingface.co/datasets/BAAI/CCI3-Data),採用了一種新型的兩階段混合過濾流程來顯著提升數據質量。為了評估其有效性,我們從頭開始在各種數據集上訓練了一個0.5B參數模型,跨越100B標記,相對於CCI3.0、SkyPile和WanjuanV1,在零-shot設置下在10個基準測試中取得了優異表現。高質量的過濾過程有效地將Qwen2-72B-instruct模型的能力提煉成一個緊湊的0.5B模型,實現了中文網絡數據分類的最優F1分數。我們相信這個開放訪問的數據集將促進更廣泛地訪問高質量的語言模型。
English
We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a
high-quality 500GB subset of the Chinese Corpora Internet 3.0
(CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a
novel two-stage hybrid filtering pipeline that significantly enhances data
quality. To evaluate its effectiveness, we trained a 0.5B parameter model from
scratch on 100B tokens across various datasets, achieving superior performance
on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and
WanjuanV1. The high-quality filtering process effectively distills the
capabilities of the Qwen2-72B-instruct model into a compact 0.5B model,
attaining optimal F1 scores for Chinese web data classification. We believe
this open-access dataset will facilitate broader access to high-quality
language models.Summary
AI-Generated Summary