多语言编码器蕴含超乎想象的知识:面向极低资源语言的共享权重预训练
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
February 15, 2025
作者: Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
cs.AI
摘要
尽管如XLM-R等多语言模型在推动NLP领域多语言处理方面取得了进展,但在极度低资源语言上的表现仍不尽如人意。这一状况因现代大语言模型(如LLaMA和Qwen)支持的语言远少于XLM-R而进一步恶化,导致全球众多语言的文本生成模型几乎空白。为应对这一挑战,我们提出了一种创新框架,旨在将多语言编码器适配至极度低资源语言的文本生成任务中。通过复用编码器与解码器间的权重,该框架使模型能够利用编码器已习得的语义空间,从而在低资源语言中实现高效学习与有效泛化。我们将此框架应用于四种中国少数民族语言,推出了XLM-SWCM,并展示了其在多项下游任务上的卓越性能,即便与规模更大的模型相比也毫不逊色。
English
While multilingual language models like XLM-R have advanced multilingualism
in NLP, they still perform poorly in extremely low-resource languages. This
situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen
support far fewer languages than XLM-R, making text generation models
non-existent for many languages in the world. To tackle this challenge, we
propose a novel framework for adapting multilingual encoders to text generation
in extremely low-resource languages. By reusing the weights between the encoder
and the decoder, our framework allows the model to leverage the learned
semantic space of the encoder, enabling efficient learning and effective
generalization in low-resource languages. Applying this framework to four
Chinese minority languages, we present XLM-SWCM, and demonstrate its superior
performance on various downstream tasks even when compared with much larger
models.Summary
AI-Generated Summary