優化義大利語的大型語言模型:透過詞彙調整降低符號生成率並提升效率
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
April 23, 2025
作者: Luca Moroni, Giovanni Puccetti, Pere-Lluis Huguet Cabot, Andrei Stefan Bejgu, Edoardo Barba, Alessio Miaschi, Felice Dell'Orletta, Andrea Esuli, Roberto Navigli
cs.AI
摘要
預訓練大型語言模型(LLMs)的數量正穩步增長,然而其中大多數主要針對英語設計。儘管最先進的LLMs能夠處理其他語言,這得益於語言混雜或一定程度的多語言預訓練數據,但它們並未針對非英語語言進行優化,導致編碼效率低下(高詞元“生育率”)和推理速度較慢。在本研究中,我們全面比較了多種詞彙適應技術,以優化英語LLMs用於意大利語,並提出了一種新方法——語義對齊詞彙適應(SAVA),該方法利用神經映射進行詞彙替換。SAVA在多個下游任務中表現出色,增強了基於對齊的策略。我們對兩個LLMs進行了適應:Mistral-7b-v0.1,將詞元生育率降低了25%,以及Llama-3.1-8B,優化了詞彙並減少了10億個參數。我們展示了在詞彙適應後,這些模型能夠通過在目標語言上進行相對有限的持續訓練階段恢復其性能。最後,我們測試了適應模型在各種多選和生成任務上的能力。
English
The number of pretrained Large Language Models (LLMs) is increasing steadily,
though the majority are designed predominantly for the English language. While
state-of-the-art LLMs can handle other languages, due to language contamination
or some degree of multilingual pretraining data, they are not optimized for
non-English languages, leading to inefficient encoding (high token "fertility")
and slower inference speed. In this work, we thoroughly compare a variety of
vocabulary adaptation techniques for optimizing English LLMs for the Italian
language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a
novel method that leverages neural mapping for vocabulary substitution. SAVA
achieves competitive performance across multiple downstream tasks, enhancing
grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing
token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and
reducing the number of parameters by 1 billion. We show that, following the
adaptation of the vocabulary, these models can recover their performance with a
relatively limited stage of continual training on the target language. Finally,
we test the capabilities of the adapted models on various multi-choice and
generative tasks.Summary
AI-Generated Summary