優化義大利語的大型語言模型：透過詞彙調整降低符號生成率並提升效率

摘要

預訓練大型語言模型（LLMs）的數量正穩步增長，然而其中大多數主要針對英語設計。儘管最先進的LLMs能夠處理其他語言，這得益於語言混雜或一定程度的多語言預訓練數據，但它們並未針對非英語語言進行優化，導致編碼效率低下（高詞元“生育率”）和推理速度較慢。在本研究中，我們全面比較了多種詞彙適應技術，以優化英語LLMs用於意大利語，並提出了一種新方法——語義對齊詞彙適應（SAVA），該方法利用神經映射進行詞彙替換。SAVA在多個下游任務中表現出色，增強了基於對齊的策略。我們對兩個LLMs進行了適應：Mistral-7b-v0.1，將詞元生育率降低了25%，以及Llama-3.1-8B，優化了詞彙並減少了10億個參數。我們展示了在詞彙適應後，這些模型能夠通過在目標語言上進行相對有限的持續訓練階段恢復其性能。最後，我們測試了適應模型在各種多選和生成任務上的能力。

English

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

優化義大利語的大型語言模型：透過詞彙調整降低符號生成率並提升效率

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

摘要

Summary

Support

Support