UnifiedCrawl:聚合Common Crawl以实现对低资源语言的LLM实惠适应

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

November 21, 2024
作者: Bethel Melesse Tessema, Akhil Kedia, Tae-Sun Chung
cs.AI

摘要

由于训练数据有限,大型语言模型(LLMs)在低资源语言上表现不佳。我们提出了一种方法,可以高效地从整个Common Crawl语料库中收集低资源语言的文本数据。我们的方法UnifiedCrawl利用最少的计算资源对Common Crawl进行过滤和提取,生成比以往可用来源大得多的单语数据集。我们展示了利用这些数据通过高效的适配器方法(QLoRA)微调多语言LLMs,显著提升低资源语言的性能,同时最小化VRAM的使用。我们的实验显示,在语言建模困惑度和少样本提示分数上取得了显著的改进。我们的工作和发布的源代码为使用消费者硬件改进低资源语言的LLMs提供了一种经济实惠的途径。我们的源代码可在以下网址获取:https://github.com/bethelmelesse/unifiedcrawl。
English
Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tuning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. Our source code is available here at https://github.com/bethelmelesse/unifiedcrawl.

Summary

AI-Generated Summary

PDF42November 22, 2024