注意力機制:具餘弦注意力的線性轉換器
Cottention: Linear Transformers With Cosine Attention
September 27, 2024
作者: Gabriel Mongaras, Trevor Dohm, Eric C. Larson
cs.AI
摘要
注意機制,特別是 softmax 注意力,對於基於 Transformer 的模型如 GPT 的成功至關重要。然而,相對於序列長度的二次記憶複雜度使 softmax 注意力面臨處理較長序列時的重大挑戰。我們引入了 Cottention,一種新穎的注意力機制,將 softmax 運算替換為餘弦相似度。通過利用餘弦相似度的特性並重新排列注意力方程式,Cottention 實現了相對於序列長度的本地線性記憶複雜度,使其比 softmax 注意力在記憶效率上更具優勢。我們證明 Cottention 可以重新表述為具有有限隱藏狀態的循環神經網絡(RNN),在推論期間實現恆定的記憶使用。我們在雙向 BERT 和因果 GPT 任務上評估了 Cottention,展示了與 softmax 注意力相當的性能,同時顯著降低了記憶需求。為確保有效計算,我們為 Cottention 開發了自定義的 CUDA 內核。我們的結果表明,Cottention 是 softmax 注意力的一個有前途的替代方案,能夠處理更長序列而不會犧牲性能,這是由於其本地線性記憶複雜度和在推論期間保持恆定記憶占用的能力。
English
Attention mechanisms, particularly softmax attention, have been instrumental
in the success of transformer-based models such as GPT. However, the quadratic
memory complexity of softmax attention with respect to sequence length poses
significant challenges for processing longer sequences. We introduce
Cottention, a novel attention mechanism that replaces the softmax operation
with cosine similarity. By leveraging the properties of cosine similarity and
rearranging the attention equation, Cottention achieves native linear memory
complexity with respect to sequence length, making it inherently more
memory-efficient than softmax attention. We demonstrate that Cottention can be
reformulated as a recurrent neural network (RNN) with a finite hidden state,
allowing for constant memory usage during inference. We evaluate Cottention on
both the bidirectional BERT and causal GPT tasks, demonstrating comparable
performance to softmax attention while significantly reducing memory
requirements. To ensure efficient computation, we develop a custom CUDA kernel
for Cottention. Our results show that Cottention is a promising alternative to
softmax attention, enabling the processing of longer sequences without
sacrificing performance, due to its native linear memory complexity and ability
to maintain a constant memory footprint during inference.Summary
AI-Generated Summary