无需归一化的Transformer

摘要

归一化层在现代神经网络中无处不在，长期以来被视为不可或缺。本研究表明，通过一种极其简单的技术，无需归一化的Transformer模型也能达到同等甚至更优的性能。我们引入了动态Tanh（DyT），这是一种逐元素操作DyT(x) = tanh(alpha x)，作为Transformer中归一化层的直接替代方案。DyT的灵感来源于观察到Transformer中的层归一化常产生类似Tanh的S形输入输出映射。通过融入DyT，无需归一化的Transformer模型在多数情况下无需超参数调优，即可匹配或超越其归一化版本的表现。我们在从识别到生成、监督学习到自监督学习、计算机视觉到语言模型等多种场景下验证了采用DyT的Transformer的有效性。这些发现挑战了现代神经网络中归一化层不可或缺的传统认知，为深入理解其在深度网络中的作用提供了新视角。

English

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x) = tanh(alpha x), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

无需归一化的Transformer

Transformers without Normalization

摘要

Support