GlotCC:一個針對少數語言的開放式廣泛覆蓋CommonCrawl語料庫和流程管道
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
October 31, 2024
作者: Amir Hossein Kargaran, François Yvon, Hinrich Schütze
cs.AI
摘要
隨著預訓練語言模型的出現,尤其是對這些模型的擴展定律的發現,對大型文本語料庫的需求日益增加。大多數現有的語料庫僅包含足夠的數據以支持具有龐大主導社區的語言。然而,目前尚無可用的語料庫同時滿足以下條件:(i) 包含廣泛的少數語言;(ii) 由開源可重現的流程生成;以及(iii) 經過嚴格清理以去除噪音,使其可信賴。我們提出 GlotCC,這是一個從 CommonCrawl 衍生的乾淨的、文件級的、2TB 通用領域語料庫,涵蓋1000多種語言。我們將 GlotCC 及用於生成它的系統(包括流程、語言識別模型和過濾器)提供給研究社區。語料庫版本 1.0 可於 https://huggingface.co/datasets/cis-lmu/GlotCC-v1 下載,流程版本 3.0 可於 https://github.com/cisnlp/GlotCC 下載。
English
The need for large text corpora has increased with the advent of pretrained
language models and, in particular, the discovery of scaling laws for these
models. Most available corpora have sufficient data only for languages with
large dominant communities. However, there is no corpus available that (i)
covers a wide range of minority languages; (ii) is generated by an open-source
reproducible pipeline; and (iii) is rigorously cleaned from noise, making it
trustworthy to use. We present GlotCC, a clean, document-level, 2TB general
domain corpus derived from CommonCrawl, covering more than 1000 languages. We
make GlotCC and the system used to generate it - including the pipeline,
language identification model, and filters - available to the research
community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1,
Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.Summary
AI-Generated Summary