과도하게 토큰화된 트랜스포머: 어휘는 일반적으로 가치가 있는 스케일링이 필요합니다.

초록

토큰화는 대형 언어 모델(LLMs)의 기본 구성 요소이지만, 모델 확장과 성능에 미치는 영향은 완전히 탐구되지 않았습니다. 본 논문에서는 입력 및 출력 어휘를 분리하여 언어 모델링 성능을 향상시키는 혁신적인 프레임워크 인 '과도 토큰화 트랜스포머(Over-Tokenized Transformers)'를 소개합니다. 구체적으로, 우리의 방법론은 다중 그램 토큰을 활용하기 위해 입력 어휘를 확장합니다. 광범위한 실험을 통해 입력 어휘 크기와 훈련 손실 사이의 로그 선형 관계를 발견하여, 모델 크기에 관계없이 더 큰 입력 어휘이 모델 성능을 일관되게 향상시킨다는 것을 입증했습니다. 큰 입력 어휘를 사용하여 추가 비용 없이 두 배 크기의 기준선과 비교 가능한 성능을 달성했습니다. 우리의 연구 결과는 스케일링 법칙에서의 토큰화의 중요성을 강조하고, 토크나이저 설계에 대한 실용적인 통찰을 제공하여 더 효율적이고 강력한 LLMs를 위한 길을 열어줍니다.

English

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

과도하게 토큰화된 트랜스포머: 어휘는 일반적으로 가치가 있는 스케일링이 필요합니다.

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

초록

Summary

Support