ChatPaper.aiChatPaper

小模型,大影响:面向低资源语言的小型多语言模型的高效语料与图结构适配

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

February 14, 2025
作者: Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann
cs.AI

摘要

低资源语言(LRLs)在自然语言处理(NLP)领域面临显著挑战,主要源于数据稀缺。尽管当前最先进的大型语言模型(LLMs)在处理LRLs时仍显吃力,但较小的多语言模型(mLMs),如mBERT和XLM-R,因其模型容量与有限训练数据更为匹配而展现出更大潜力。本研究系统探讨了基于适配器的参数高效方法,用于将mLMs适配至LRLs,评估了三种架构:序列瓶颈、可逆瓶颈和低秩适配。利用GlotCC的非结构化文本及ConceptNet的结构化知识,我们证明,即使是小规模的适配数据集(例如,最多1GB的自由文本或几MB的知识图谱数据),也能在内在任务(掩码语言建模)和外在任务(主题分类、情感分析和命名实体识别)上带来性能提升。研究发现,序列瓶颈适配器在语言建模方面表现优异,而可逆瓶颈适配器由于更好的嵌入对齐和更多的参数数量,在下游任务上略胜一筹。基于适配器的方法在参数使用远少于全微调的情况下,性能相当或更优,且较小的mLMs在处理LRLs时比LLaMA-3、GPT-4及基于DeepSeek-R1的蒸馏模型等大规模LLMs更为有效。尽管适配提升了性能,预训练数据规模仍是决定性因素,尤其对于预训练覆盖广泛的语言而言。
English
Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.

Summary

AI-Generated Summary

PDF92February 17, 2025