GlotCC：一种用于少数民族语言的开放式广覆盖CommonCrawl语料库和流程管道

摘要

随着预训练语言模型的出现，特别是对这些模型的缩放规律的发现，对大型文本语料库的需求日益增加。大多数现有语料库仅具有足够的数据，适用于拥有庞大主导社区的语言。然而，目前尚无可用的语料库同时满足以下条件：(i)覆盖广泛的少数语言；(ii)由开源可重现的流程生成；以及(iii)经过严格清理，去除噪音，使其可靠可用。我们提出了GlotCC，这是一个干净的、文档级别的、2TB通用领域语料库，源自CommonCrawl，涵盖1000多种语言。我们向研究社区提供了GlotCC及用于生成它的系统，包括流程、语言识别模型和过滤器。语料库版本1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1，流程版本3.0 https://github.com/cisnlp/GlotCC。

English

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

GlotCC：一种用于少数民族语言的开放式广覆盖CommonCrawl语料库和流程管道

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

摘要

Summary

Support

Support