通过词汇课程扩展大规模语言模型预训练

摘要

现代语言模型依赖于在预训练前确定的静态词汇表，这与人类语言学习过程中观察到的自适应词汇获取形成鲜明对比。为弥合这一差距，我们引入了词汇课程学习法，该方法通过相对于词汇大小的对数线性扩展增益来提升预训练效率。我们的方法在基于熵引导的词汇扩展与模型优化之间交替进行，使模型能够学习跨不同分词粒度的可迁移表示。这一方法自然催生了一种最优计算分配模式：较长的词元捕捉可预测内容，而较短的词元则聚焦于更复杂、更难预测的上下文。在小型GPT模型上的实验展示了改进的扩展效率，进一步证实了动态分词的有效性。我们公开了代码以支持后续研究，并计划将实验扩展至更大模型及多样领域。

English

Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.

通过词汇课程扩展大规模语言模型预训练

Scaling LLM Pre-training with Vocabulary Curriculum

摘要

Summary

Support

Support