UnifiedCrawl：聚合Common Crawl以实现对低资源语言的LLM实惠适应

摘要

由于训练数据有限，大型语言模型（LLMs）在低资源语言上表现不佳。我们提出了一种方法，可以高效地从整个Common Crawl语料库中收集低资源语言的文本数据。我们的方法UnifiedCrawl利用最少的计算资源对Common Crawl进行过滤和提取，生成比以往可用来源大得多的单语数据集。我们展示了利用这些数据通过高效的适配器方法（QLoRA）微调多语言LLMs，显著提升低资源语言的性能，同时最小化VRAM的使用。我们的实验显示，在语言建模困惑度和少样本提示分数上取得了显著的改进。我们的工作和发布的源代码为使用消费者硬件改进低资源语言的LLMs提供了一种经济实惠的途径。我们的源代码可在以下网址获取：https://github.com/bethelmelesse/unifiedcrawl。

English

Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tuning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. Our source code is available here at https://github.com/bethelmelesse/unifiedcrawl.

UnifiedCrawl：聚合Common Crawl以实现对低资源语言的LLM实惠适应

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

摘要

Summary

Support