在语言模型中对嵌入层进行缩放

摘要

我们提出了SCONE（可扩展、上下文化、卸载、N-gram嵌入），这是一种用于扩展输入嵌入层以增强语言模型性能的方法，随着层大小的扩展。为了避免增加解码成本，SCONE保留了原始词汇，同时为一组频繁的n-gram引入了嵌入。这些嵌入为每个输入标记提供了上下文化表示，并在训练期间使用单独的模型进行学习。在推断期间，它们被预先计算并存储在离加速器内存很远的位置，对推断速度影响很小。SCONE实现了两种新的扩展策略：增加缓存的n-gram嵌入数量和扩展用于学习它们的模型，同时保持固定的推断时浮点运算数（FLOPS）。我们展示了扩展这两个方面使SCONE能够在各种语料库中胜过一个拥有19亿参数基线模型，同时仅使用一半的推断时FLOPS。

English

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

在语言模型中对嵌入层进行缩放

Scaling Embedding Layers in Language Models

摘要

Summary

Support