通过词汇课程扩展大规模语言模型预训练
Scaling LLM Pre-training with Vocabulary Curriculum
February 25, 2025
作者: Fangyuan Yu
cs.AI
摘要
现代语言模型依赖于在预训练前确定的静态词汇表,这与人类语言学习过程中观察到的自适应词汇获取形成鲜明对比。为弥合这一差距,我们引入了词汇课程学习法,该方法通过相对于词汇大小的对数线性扩展增益来提升预训练效率。我们的方法在基于熵引导的词汇扩展与模型优化之间交替进行,使模型能够学习跨不同分词粒度的可迁移表示。这一方法自然催生了一种最优计算分配模式:较长的词元捕捉可预测内容,而较短的词元则聚焦于更复杂、更难预测的上下文。在小型GPT模型上的实验展示了改进的扩展效率,进一步证实了动态分词的有效性。我们公开了代码以支持后续研究,并计划将实验扩展至更大模型及多样领域。
English
Modern language models rely on static vocabularies, fixed before pretraining,
in contrast to the adaptive vocabulary acquisition observed in human language
learning. To bridge this gap, we introduce vocabulary curriculum learning, an
approach that improves pretraining efficiency with log-linear scaling gains
relative to vocabulary size. Our method alternates between entropy-guided
vocabulary expansion and model optimization, enabling models to learn
transferable representations across diverse tokenization granularities. This
approach naturally gives rise to an optimal computation allocation pattern:
longer tokens capture predictable content, while shorter tokens focus on more
complex, harder-to-predict contexts. Experiments on small-scale GPT models
demonstrate improved scaling efficiency, reinforcing the effectiveness of
dynamic tokenization. We release our code to support further research and plan
to extend our experiments to larger models and diverse domains.Summary
AI-Generated Summary