评估大型语言模型在印度官方语言中的分词器性能
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
November 19, 2024
作者: S. Tamang, D. J. Bora
cs.AI
摘要
基于Transformer架构的大型语言模型(LLMs)已经在各个领域引起了革命,而标记化在它们的预处理和微调阶段中起着至关重要的作用。在多语言模型中,特别是针对印度语言定制的模型中,有效的标记化对于优化性能至关重要。本文对12个LLMs使用的分词器在印度的所有22种官方语言中进行了全面评估,重点比较它们标记化过程的效率。我们采用了归一化序列长度(NSL)作为我们分析的关键指标。我们的研究结果显示,SUTRA分词器在14种语言中表现优异,胜过所有其他模型,包括几个针对印度语言的模型。值得注意的见解包括SUTRA分词器在处理印度语言方面的卓越表现,GPT-4o相对于其前身GPT-4在处理印度语言方面的进步,以及Project Indus在某些语言中的有限性能。这项研究强调了为多语言和以印度语为中心的模型开发定向标记化策略的重要性,为未来改进分词器设计以增强语言覆盖范围和模型效率奠定了基础。
English
Large Language Models (LLMs) based on transformer architectures have
revolutionized a variety of domains, with tokenization playing a pivotal role
in their pre-processing and fine-tuning stages. In multilingual models,
particularly those tailored for Indic languages, effective tokenization is
crucial for optimizing performance. This paper presents a comprehensive
evaluation of tokenizers used by 12 LLMs across all 22 official languages of
India, with a focus on comparing the efficiency of their tokenization
processes. We employed the Normalized Sequence Length (NSL) as a key metric in
our analysis. Our findings reveal that the SUTRA tokenizer outperforms all
other models, including several Indic-specific models, excelling in 14
languages. Notable insights include the SUTRA tokenizer's superior handling of
Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing
Indian languages, and the limited performance of Project Indus in certain
languages. This study underscores the critical importance of developing
targeted tokenization strategies for multilingual and Indic-centric models,
laying the groundwork for future improvements in tokenizer design to enhance
linguistic coverage and model efficiency.Summary
AI-Generated Summary