ChatPaper.aiChatPaper

在语言模型中对嵌入层进行缩放

Scaling Embedding Layers in Language Models

February 3, 2025
作者: Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang
cs.AI

摘要

我们提出了SCONE(可扩展、上下文化、卸载、N-gram嵌入),这是一种用于扩展输入嵌入层以增强语言模型性能的方法,随着层大小的扩展。为了避免增加解码成本,SCONE保留了原始词汇,同时为一组频繁的n-gram引入了嵌入。这些嵌入为每个输入标记提供了上下文化表示,并在训练期间使用单独的模型进行学习。在推断期间,它们被预先计算并存储在离加速器内存很远的位置,对推断速度影响很小。SCONE实现了两种新的扩展策略:增加缓存的n-gram嵌入数量和扩展用于学习它们的模型,同时保持固定的推断时浮点运算数(FLOPS)。我们展示了扩展这两个方面使SCONE能够在各种语料库中胜过一个拥有19亿参数基线模型,同时仅使用一半的推断时FLOPS。
English
We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

Summary

AI-Generated Summary

PDF244February 4, 2025