소형 모델, 큰 영향: 저자원 언어를 위한 소형 다국어 언어 모델의 효율적인 코퍼스 및 그래프 기반 적응

초록

저자원 언어(Low-resource languages, LRLs)는 데이터의 한계로 인해 자연어 처리(Natural Language Processing, NLP)에서 상당한 어려움에 직면하고 있습니다. 현재 최첨단 대형 언어 모델(Large Language Models, LLMs)도 LRLs를 다루는 데 여전히 어려움을 겪고 있지만, mBERT와 XLM-R과 같은 소규모 다국어 모델(multilingual models, mLMs)은 낮은 학습 데이터 크기에 더 적합한 용량으로 인해 더 큰 가능성을 보여줍니다. 본 연구는 mLMs를 LRLs에 적응시키기 위한 매개변수 효율적인 어댑터 기반 방법을 체계적으로 조사하며, 세 가지 아키텍처인 순차적 병목(Sequential Bottleneck), 역변환 병목(Invertible Bottleneck), 그리고 저랭크 적응(Low-Rank Adaptation)을 평가합니다. GlotCC의 비정형 텍스트와 ConceptNet의 구조화된 지식을 사용하여, 소규모 적응 데이터셋(예: 최대 1GB의 자유 텍스트 또는 몇 MB의 지식 그래프 데이터)이 내재적(마스크 언어 모델링) 및 외재적 작업(주제 분류, 감정 분석, 개체명 인식)에서 성능 향상을 가져온다는 것을 보여줍니다. 순차적 병목 어댑터는 언어 모델링에서 뛰어난 성능을 보이는 반면, 역변환 병목 어댑터는 더 나은 임베딩 정렬과 더 많은 매개변수 수로 인해 다운스트림 작업에서 다른 방법들보다 약간 더 우수한 성능을 보입니다. 어댑터 기반 방법은 전체 미세 조정(full fine-tuning)과 동등하거나 더 나은 성능을 보이면서 훨씬 적은 매개변수를 사용하며, LLaMA-3, GPT-4, DeepSeek-R1 기반의 증류 모델과 같은 대규모 LLMs보다 소규모 mLMs가 LRLs에 더 효과적임을 입증합니다. 적응은 성능을 개선하지만, 특히 광범위한 사전 학습 데이터를 가진 언어의 경우 사전 학습 데이터 크기가 여전히 지배적인 요소로 남아 있습니다.

English

Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.

소형 모델, 큰 영향: 저자원 언어를 위한 소형 다국어 언어 모델의 효율적인 코퍼스 및 그래프 기반 적응

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

초록

Summary

Support