ChatPaper.aiChatPaper

值殘差學習以減輕注意力集中在變壓器中的影響

Value Residual Learning For Alleviating Attention Concentration In Transformers

October 23, 2024
作者: Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan
cs.AI

摘要

Transformer 可以使用自注意力機制捕捉長距離依賴性,使 token 能夠直接關注所有其他 token。然而,堆疊多個注意力層會導致注意力集中。解決這個問題的一種自然方法是使用跨層注意力,使得早期層的信息可以直接被後續層訪問。然而,這種方法在計算上是昂貴的。為了解決這個問題,我們提出了具有殘差值(ResFormer)的 Transformer,通過將第一層的值添加到所有後續層來近似跨層注意力。基於這種方法,一種變體是具有單層值(SVFormer)的 Transformer,其中所有層共享來自第一層的相同值嵌入,將 KV 緩存減少了近 50%。全面的實證證據表明,ResFormer 能夠減輕深層中的注意力集中問題,並增強大多數層的表示,優於普通的 Transformer、DenseFormer 和 NeuTRENO 在訓練錯誤以及下游任務中的表現。SVFormer 的訓練速度顯著快於普通的 Transformer,並且優於其他方法如 GQA 和 CLA,其性能受序列長度和累積學習率的影響。
English
Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

Summary

AI-Generated Summary

PDF92November 16, 2024