UnifiedCrawl:整合Common Crawl以實現低資源語言上LLM的負擔較低的適應。

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

November 21, 2024
作者: Bethel Melesse Tessema, Akhil Kedia, Tae-Sun Chung
cs.AI

摘要

由於訓練數據有限,大型語言模型(LLMs)在低資源語言上表現不佳。我們提出了一種方法,可以有效地從整個 Common Crawl 語料庫中收集低資源語言的文本數據。我們的方法 UnifiedCrawl 使用最少的計算資源來過濾和提取 Common Crawl,從而產生比以前可用來源更大的單語言數據集。我們展示了利用這些數據通過高效的適配器方法(QLoRA)對多語言 LLMS 進行微調,顯著提高低資源語言的性能,同時最小化 VRAM 使用量。我們的實驗顯示,在語言建模困惑度方面取得了很大的改善,並提高了少量提示分數。我們的工作和釋出的源代碼提供了一種負擔得起的方法,可以使用消費者硬件改進低資源語言的 LLMs。我們的源代碼可以在以下網址找到:https://github.com/bethelmelesse/unifiedcrawl。
English
Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tuning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. Our source code is available here at https://github.com/bethelmelesse/unifiedcrawl.

Summary

AI-Generated Summary

PDF72November 22, 2024