評估大型語言模型在印度官方語言中的分詞器性能

摘要

基於Transformer架構的大型語言模型（LLMs）已經在各個領域引起了革命，而在其預處理和微調階段中，分詞在其中發揮了關鍵作用。在多語言模型中，特別是針對印度語言定制的模型，有效的分詞對於優化性能至關重要。本文對12個LLMs使用的分詞器在印度的所有22種官方語言中進行了全面評估，重點比較它們的分詞過程的效率。我們採用了標準化序列長度（NSL）作為分析的關鍵指標。我們的研究結果顯示，SUTRA分詞器在14種語言中表現優異，優於所有其他模型，包括幾個針對印度語言的模型。值得注意的見解包括SUTRA分詞器在處理印度語言方面的卓越表現，GPT-4o相對於其前身GPT-4在處理印度語言方面的進步，以及Project Indus在某些語言中的有限性能。這項研究強調了為多語言和印度中心模型開發定向分詞策略的關鍵重要性，為未來改進分詞器設計以增強語言覆蓋率和模型效率奠定了基礎。

English

Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.

評估大型語言模型在印度官方語言中的分詞器性能

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

摘要

Summary

Support