可扩展Softmax在注意力机制中表现优越。

摘要

随着输入向量大小的增加，Softmax函数输出的向量中的最大元素逐渐接近零。基于Transformer的语言模型依赖Softmax来计算注意力分数，导致随着上下文大小的增长，注意力分布变得更加平坦。这降低了模型有效优先处理关键信息的能力，潜在地限制了其长度泛化能力。为解决这一问题，我们提出了可扩展Softmax（SSMax），用于取代Softmax在输入向量大小变化的情况下。SSMax可以无缝集成到现有的基于Transformer的架构中。在语言建模方面的实验结果显示，使用SSMax的模型不仅在预训练期间实现更快的损失减少，而且在处理长上下文和关键信息检索方面显著提高性能。此外，注意力分数的分析显示，SSMax使模型能够在长上下文中集中注意力于关键信息。此外，尽管从预训练开始就使用SSMax的模型实现了更好的长度泛化，但那些已经开始预训练的模型仍可以通过在注意力层中在预训练期间或之后用SSMax替换Softmax来获得这种能力的一部分。

English

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.

可扩展Softmax在注意力机制中表现优越。

Scalable-Softmax Is Superior for Attention

摘要

Summary

Support