无需归一化的Transformer
Transformers without Normalization
March 13, 2025
作者: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
cs.AI
摘要
归一化层在现代神经网络中无处不在,长期以来被视为不可或缺。本研究表明,通过一种极其简单的技术,无需归一化的Transformer模型也能达到同等甚至更优的性能。我们引入了动态Tanh(DyT),这是一种逐元素操作DyT(x) = tanh(alpha x),作为Transformer中归一化层的直接替代方案。DyT的灵感来源于观察到Transformer中的层归一化常产生类似Tanh的S形输入输出映射。通过融入DyT,无需归一化的Transformer模型在多数情况下无需超参数调优,即可匹配或超越其归一化版本的表现。我们在从识别到生成、监督学习到自监督学习、计算机视觉到语言模型等多种场景下验证了采用DyT的Transformer的有效性。这些发现挑战了现代神经网络中归一化层不可或缺的传统认知,为深入理解其在深度网络中的作用提供了新视角。
English
Normalization layers are ubiquitous in modern neural networks and have long
been considered essential. This work demonstrates that Transformers without
normalization can achieve the same or better performance using a remarkably
simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
DyT(x) = tanh(alpha x), as a drop-in replacement for normalization
layers in Transformers. DyT is inspired by the observation that layer
normalization in Transformers often produces tanh-like, S-shaped input-output
mappings. By incorporating DyT, Transformers without normalization can match or
exceed the performance of their normalized counterparts, mostly without
hyperparameter tuning. We validate the effectiveness of Transformers with DyT
across diverse settings, ranging from recognition to generation, supervised to
self-supervised learning, and computer vision to language models. These
findings challenge the conventional understanding that normalization layers are
indispensable in modern neural networks, and offer new insights into their role
in deep networks.Summary
AI-Generated Summary