GlotCC：一個針對少數語言的開放式廣泛覆蓋CommonCrawl語料庫和流程管道

摘要

隨著預訓練語言模型的出現，尤其是對這些模型的擴展定律的發現，對大型文本語料庫的需求日益增加。大多數現有的語料庫僅包含足夠的數據以支持具有龐大主導社區的語言。然而，目前尚無可用的語料庫同時滿足以下條件：(i) 包含廣泛的少數語言；(ii) 由開源可重現的流程生成；以及(iii) 經過嚴格清理以去除噪音，使其可信賴。我們提出 GlotCC，這是一個從 CommonCrawl 衍生的乾淨的、文件級的、2TB 通用領域語料庫，涵蓋1000多種語言。我們將 GlotCC 及用於生成它的系統（包括流程、語言識別模型和過濾器）提供給研究社區。語料庫版本 1.0 可於 https://huggingface.co/datasets/cis-lmu/GlotCC-v1 下載，流程版本 3.0 可於 https://github.com/cisnlp/GlotCC 下載。

English

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

GlotCC：一個針對少數語言的開放式廣泛覆蓋CommonCrawl語料庫和流程管道

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

摘要

Summary

Support

Support