您尚未充分利用Transformer的表示能力

摘要

与将先前标记压缩为单一隐藏状态的RNN不同，Transformer能够直接关注所有先前的标记。然而，标准的Transformer仅使用紧邻前一层的表示。本文中，我们揭示了这一设计选择会导致表示坍缩，进而影响模型性能。为解决此问题，我们提出了层集成记忆（LIMe），这是一种简洁而强大的方法，它在保持模型整体内存占用的同时，通过允许访问更早层的隐藏状态，扩展了模型的表示能力。通过在不同架构和多种查找机制上的广泛实验，我们在多种任务上展示了持续的性能提升。此外，我们对学习到的表示动态的分析以及对深度方向电路的探索，揭示了LIMe如何跨层整合信息，为未来研究指明了有前景的方向。

English

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

您尚未充分利用Transformer的表示能力

You Do Not Fully Utilize Transformer's Representation Capacity

摘要

Summary

Support