CCI3.0-HQ: 대규모 중국어 데이터셋으로, 대형 언어 모델 사전 훈련을 위해 고품질로 설계되었습니다.

초록

우리는 CCI3.0-HQ(https://huggingface.co/datasets/BAAI/CCI3-HQ)를 제공합니다. 이는 중국어 말뭉치 인터넷 3.0(CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data)의 고품질 500GB 하위 집합으로, 데이터 품질을 크게 향상시키는 혁신적인 이중 단계 하이브리드 필터링 파이프라인을 사용하여 개발되었습니다. 효과를 평가하기 위해 우리는 다양한 데이터셋에서 100B 토큰을 사용하여 0.5B 매개변수 모델을 처음부터 훈련시켜, CCI3.0, SkyPile, WanjuanV1과 비교하여 제로샷 설정에서 10개의 벤치마크에서 우수한 성능을 달성했습니다. 고품질 필터링 과정은 Qwen2-72B-instruct 모델의 능력을 0.5B 모델로 효과적으로 증류시켜, 중국어 웹 데이터 분류에 대한 최적의 F1 점수를 달성했습니다. 이 접근 가능한 데이터셋은 고품질 언어 모델에 대한 보다 넓은 접근을 용이하게 할 것으로 믿습니다.

English

We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.

CCI3.0-HQ: 대규모 중국어 데이터셋으로, 대형 언어 모델 사전 훈련을 위해 고품질로 설계되었습니다.

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

초록

Support