科爾莫哥洛夫-阿諾德變換器
Kolmogorov-Arnold Transformer
September 16, 2024
作者: Xingyi Yang, Xinchao Wang
cs.AI
摘要
Transformer已成為現代深度學習的基石。傳統上,這些模型依賴多層感知器(MLP)層來在通道之間混合信息。本文介紹Kolmogorov-Arnold Transformer(KAT),這是一種新穎的架構,它將MLP層替換為Kolmogorov-Arnold Network(KAN)層,以增強模型的表達能力和性能。然而,將KAN整合到Transformer中並非易事,特別是在擴展時。具體而言,我們確定了三個關鍵挑戰:(C1)基本功能。KAN中使用的標準B樣條函數並未針對現代硬件上的並行計算進行優化,導致推理速度較慢。(C2)參數和計算效率低下。KAN需要為每個輸入-輸出對應一個獨特的函數,使計算量極大。(C3)權重初始化。由於KAN中的可學習激活函數對於實現深度神經網絡的收斂至關重要,因此權重的初始化尤為具有挑戰性。為了克服上述挑戰,我們提出了三個關鍵解決方案:(S1)有理基礎。我們將B樣條函數替換為有理函數,以提高與現代GPU的兼容性。通過在CUDA中實現這一點,我們實現了更快的計算。(S2)組KAN。我們通過一組神經元共享激活權重,以減少計算負載而不影響性能。(S3)保持變異初始化。我們精心初始化激活權重,以確保激活變異在各層之間保持一致。通過這些設計,KAT能夠有效擴展並輕鬆勝過傳統基於MLP的Transformer。
English
Transformers stand as the cornerstone of mordern deep learning.
Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix
the information between channels. In this paper, we introduce the
Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP
layers with Kolmogorov-Arnold Network (KAN) layers to enhance the
expressiveness and performance of the model. Integrating KANs into
transformers, however, is no easy feat, especially when scaled up.
Specifically, we identify three key challenges: (C1) Base function. The
standard B-spline function used in KANs is not optimized for parallel computing
on modern hardware, resulting in slower inference speeds. (C2) Parameter and
Computation Inefficiency. KAN requires a unique function for each input-output
pair, making the computation extremely large. (C3) Weight initialization. The
initialization of weights in KANs is particularly challenging due to their
learnable activation functions, which are critical for achieving convergence in
deep neural networks. To overcome the aforementioned challenges, we propose
three key solutions: (S1) Rational basis. We replace B-spline functions with
rational functions to improve compatibility with modern GPUs. By implementing
this in CUDA, we achieve faster computations. (S2) Group KAN. We share the
activation weights through a group of neurons, to reduce the computational load
without sacrificing performance. (S3) Variance-preserving initialization. We
carefully initialize the activation weights to make sure that the activation
variance is maintained across layers. With these designs, KAT scales
effectively and readily outperforms traditional MLP-based transformers.Summary
AI-Generated Summary