트랜스포머에서 주의 집중 완화를 위한 가치 잔차 학습

초록

트랜스포머는 셀프 어텐션을 사용하여 장거리 의존성을 포착할 수 있으며, 각 토큰이 직접 다른 모든 토큰에 관심을 기울일 수 있습니다. 그러나 여러 어텐션 레이어를 쌓으면 어텐션 집중이 발생합니다. 이 문제를 해결하는 자연스러운 방법은 교차 레이어 어텐션을 사용하는 것으로, 초기 레이어의 정보가 후속 레이어에서 직접적으로 접근 가능하도록 합니다. 그러나 이 방법은 계산 비용이 많이 듭니다. 이 문제를 해결하기 위해 저희는 잔여 값(Residual Value)을 추가하여 교차 레이어 어텐션을 근사하는 ResFormer를 제안합니다. 이 방법을 기반으로 한 변형 중 하나는 단일 레이어 값(SVFormer)를 사용하는 것으로, 모든 레이어가 첫 번째 레이어의 값 임베딩을 공유하여 KV 캐시를 거의 50%로 줄입니다. 포괄적인 경험적 증거는 ResFormer가 깊은 레이어에서의 어텐션 집중 문제를 완화하고 대부분의 레이어에서 표현을 향상시키며, 훈련 오류 및 하위 작업에서 일반적인 트랜스포머, DenseFormer 및 NeuTRENO보다 우수한 성능을 보여준다는 것을 입증합니다. SVFormer는 일반 트랜스포머보다 훈련 속도가 현저히 빠르며, GQA 및 CLA와 같은 다른 방법보다 더 나은 성능을 발휘하며, 시퀀스 길이와 누적 학습률에 의해 성능이 영향을 받습니다.

English

Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

트랜스포머에서 주의 집중 완화를 위한 가치 잔차 학습

Value Residual Learning For Alleviating Attention Concentration In Transformers

초록

Summary

Support