Apprendimento residuo del valore per alleviare la concentrazione dell'attenzione nei Transformers

Abstract

I Transformers possono catturare dipendenze a lungo raggio utilizzando l'autoattenzione, consentendo ai token di prestare attenzione direttamente a tutti gli altri. Tuttavia, impilare più strati di attenzione porta a concentrazione dell'attenzione. Un modo naturale per affrontare questo problema è utilizzare l'attenzione tra strati, consentendo alle informazioni degli strati precedenti di essere direttamente accessibili agli strati successivi. Tuttavia, questo approccio è computazionalmente costoso. Per affrontare questo problema, proponiamo il Transformer con valore residuo (ResFormer) che approssima l'attenzione tra strati attraverso l'aggiunta di una connessione residua dai valori del primo strato a tutti gli strati successivi. Basandoci su questo metodo, una variante è il Transformer con valore a singolo strato (SVFormer), in cui tutti gli strati condividono l'embedding di valore dello stesso primo strato, riducendo la cache KV di quasi il 50%. Evidenze empiriche esaustive dimostrano che ResFormer attenua il problema della concentrazione dell'attenzione negli strati più profondi e migliora la rappresentazione attraverso la maggior parte degli strati, superando il Transformer standard, DenseFormer e NeuTRENO nell'errore di addestramento e nelle attività derivate. SVFormer si addestra significativamente più velocemente rispetto al Transformer standard e ottiene risultati migliori rispetto ad altri metodi come GQA e CLA, con prestazioni influenzate dalla lunghezza della sequenza e dal tasso di apprendimento cumulativo.

English

Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

Apprendimento residuo del valore per alleviare la concentrazione dell'attenzione nei Transformers

Value Residual Learning For Alleviating Attention Concentration In Transformers

Abstract

Summary

Support