重新思考多语言持续预训练:通过数据混合实现跨语言与资源的LLM适配
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
April 5, 2025
作者: Zihao Li, Shaoxiong Ji, Hengyu Luo, Jörg Tiedemann
cs.AI
摘要
大型语言模型(LLMs)在不同语言间的表现存在显著差异,主要惠及高资源语言,而边缘化了资源匮乏的语言。持续预训练(CPT)作为一种有前景的方法,旨在解决这一不平衡问题,尽管单语、双语及代码增强数据策略的相对有效性尚不明确。本研究系统评估了涉及三种多语言基础模型的36种CPT配置,覆盖了30多种语言,这些语言根据其资源水平被分类为利他型、自私型和停滞型。研究发现揭示了三大要点:(1)双语CPT提升了多语言分类能力,但在生成过程中常引发语言混杂问题。(2)在CPT过程中加入编程代码数据持续提高了多语言分类的准确性,尤其对低资源语言有益,但以轻微降低生成质量为代价。(3)与先前研究相反,我们观察到语言分类对其跨语言迁移影响存在显著偏差:被归类为利他型的语言往往对相关语言产生负面影响,自私型语言表现出条件性和配置依赖性的行为,而停滞型语言在某些CPT条件下展现了惊人的适应能力。这些微妙的交互作用凸显了多语言表示学习的复杂性,强调了系统研究可推广的语言分类对于指导未来多语言CPT策略的重要性。
English
Large Language Models (LLMs) exhibit significant disparities in performance
across languages, primarily benefiting high-resource languages while
marginalizing underrepresented ones. Continual Pretraining (CPT) has emerged as
a promising approach to address this imbalance, although the relative
effectiveness of monolingual, bilingual, and code-augmented data strategies
remains unclear. This study systematically evaluates 36 CPT configurations
involving three multilingual base models, across 30+ languages categorized as
altruistic, selfish, and stagnant, spanning various resource levels. Our
findings reveal three major insights: (1) Bilingual CPT improves multilingual
classification but often causes language mixing issues during generation. (2)
Including programming code data during CPT consistently enhances multilingual
classification accuracy, particularly benefiting low-resource languages, but
introduces a trade-off by slightly degrading generation quality. (3) Contrary
to prior work, we observe substantial deviations from language classifications
according to their impact on cross-lingual transfer: Languages classified as
altruistic often negatively affect related languages, selfish languages show
conditional and configuration-dependent behavior, and stagnant languages
demonstrate surprising adaptability under certain CPT conditions. These nuanced
interactions emphasize the complexity of multilingual representation learning,
underscoring the importance of systematic studies on generalizable language
classification to inform future multilingual CPT strategies.Summary
AI-Generated Summary