MUDDFormer：通过多路动态密集连接突破Transformer中的残差瓶颈

摘要

我们提出了多路动态密集连接（MUDD），这是一种简单而有效的方法，旨在解决残差连接的局限性并增强Transformer中的跨层信息流动。与现有采用静态共享连接权重的密集连接方法不同，MUDD根据每个序列位置的隐藏状态以及Transformer模块中解耦的输入流（查询、键、值或残差）动态生成连接权重。MUDD连接可以无缝集成到任何Transformer架构中，形成MUDDFormer。大量实验表明，MUDDFormer在语言建模任务中显著优于各种模型架构和规模的Transformer，达到了使用1.8倍至2.4倍计算资源训练的Transformer的性能。值得注意的是，MUDDPythia-2.8B在预训练的困惑度（ppl）和下游任务中与Pythia-6.9B相当，甚至在五次射击设置中与Pythia-12B相媲美，而仅增加了0.23%的参数和0.4%的计算量。JAX和PyTorch代码及预训练模型可在https://github.com/Caiyun-AI/MUDDFormer 获取。

English

We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .

MUDDFormer：通过多路动态密集连接突破Transformer中的残差瓶颈

MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

摘要

Summary

Support