过度标记的Transformer:通常值得调整词汇量
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
January 28, 2025
作者: Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou
cs.AI
摘要
在大型语言模型(LLMs)中,分词是一个基础组件,但其对模型扩展和性能的影响尚未完全探讨。本文介绍了一种新颖的框架——过度分词的Transformer,该框架将输入和输出词汇解耦以提高语言建模性能。具体而言,我们的方法通过扩展输入词汇以利用多克隆标记。通过大量实验,我们发现输入词汇大小与训练损失之间存在对数线性关系,表明较大的输入词汇始终能提升模型性能,而模型大小并不影响这一结果。利用大型输入词汇,我们实现了与双倍基线相媲美的性能,且无需额外成本。我们的研究强调了分词在扩展规律中的重要性,并为分词器设计提供了实用见解,为更高效、更强大的LLMs铺平了道路。
English
Tokenization is a fundamental component of large language models (LLMs), yet
its influence on model scaling and performance is not fully explored. In this
paper, we introduce Over-Tokenized Transformers, a novel framework that
decouples input and output vocabularies to improve language modeling
performance. Specifically, our approach scales up input vocabularies to
leverage multi-gram tokens. Through extensive experiments, we uncover a
log-linear relationship between input vocabulary size and training loss,
demonstrating that larger input vocabularies consistently enhance model
performance, regardless of model size. Using a large input vocabulary, we
achieve performance comparable to double-sized baselines with no additional
cost. Our findings highlight the importance of tokenization in scaling laws and
provide practical insight for tokenizer design, paving the way for more
efficient and powerful LLMs.Summary
AI-Generated Summary