대규모 언어 모델의 토크나이저 성능을 평가하는 공식 인도어 언어 간

초록

트랜스포머 아키텍처를 기반으로 한 대형 언어 모델(Large Language Models, LLMs)은 다양한 영역을 혁신하며, 토큰화는 전처리 및 세밀 조정 단계에서 중추적인 역할을 합니다. 특히 인디아어에 맞춘 다국어 모델에서는 효과적인 토큰화가 성능 최적화에 중요합니다. 본 논문은 인도의 모든 22개 공식 언어를 대상으로 12개 LLMs에서 사용된 토크나이저의 종합적인 평가를 제시하며, 토큰화 과정의 효율성을 비교하는 데 초점을 맞춥니다. 분석에서 주요 지표로 정규화된 시퀀스 길이(Normalized Sequence Length, NSL)를 활용했습니다. 연구 결과는 SUTRA 토크나이저가 14개 언어에서 포함하여 다른 모델들을 능가하는 것을 밝혀냅니다. 주목할 만한 인사이트로는 SUTRA 토크나이저가 인디아어를 우수하게 처리하며, GPT-4o가 선배 모델인 GPT-4보다 인도어 처리에서 진보한 점, 그리고 특정 언어에서 Project Indus의 제한된 성능이 있습니다. 본 연구는 다국어 및 인디아 중심 모델을 위한 표적 토큰화 전략 개발의 중요성을 강조하며, 언어적 커버리지와 모델 효율성을 향상시키기 위한 토크나이저 설계 개선을 위한 토대를 마련합니다.

English

Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.

대규모 언어 모델의 토크나이저 성능을 평가하는 공식 인도어 언어 간

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

초록

Summary

Support