您尚未充分利用Transformer的表示能力
You Do Not Fully Utilize Transformer's Representation Capacity
February 13, 2025
作者: Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
cs.AI
摘要
与将先前标记压缩为单一隐藏状态的RNN不同,Transformer能够直接关注所有先前的标记。然而,标准的Transformer仅使用紧邻前一层的表示。本文中,我们揭示了这一设计选择会导致表示坍缩,进而影响模型性能。为解决此问题,我们提出了层集成记忆(LIMe),这是一种简洁而强大的方法,它在保持模型整体内存占用的同时,通过允许访问更早层的隐藏状态,扩展了模型的表示能力。通过在不同架构和多种查找机制上的广泛实验,我们在多种任务上展示了持续的性能提升。此外,我们对学习到的表示动态的分析以及对深度方向电路的探索,揭示了LIMe如何跨层整合信息,为未来研究指明了有前景的方向。
English
In contrast to RNNs, which compress previous tokens into a single hidden
state, Transformers can attend to all previous tokens directly. However,
standard Transformers only use representations from the immediately preceding
layer. In this paper, we show that this design choice causes representation
collapse and leads to suboptimal performance. To address this issue, we
introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
preserves the model's overall memory footprint while expanding its
representational capacity by allowing access to hidden states from earlier
layers. Through extensive experiments across various architectures and
different lookup mechanisms, we demonstrate consistent performance improvements
on a wide range of tasks. Moreover, our analysis of the learned representation
dynamics and our exploration of depthwise circuits reveal how LIMe integrates
information across layers, pointing to promising directions for future
research.Summary
AI-Generated Summary