科爾莫哥洛夫-阿諾德變換器

摘要

Transformer已成為現代深度學習的基石。傳統上，這些模型依賴多層感知器（MLP）層來在通道之間混合信息。本文介紹Kolmogorov-Arnold Transformer（KAT），這是一種新穎的架構，它將MLP層替換為Kolmogorov-Arnold Network（KAN）層，以增強模型的表達能力和性能。然而，將KAN整合到Transformer中並非易事，特別是在擴展時。具體而言，我們確定了三個關鍵挑戰：（C1）基本功能。KAN中使用的標準B樣條函數並未針對現代硬件上的並行計算進行優化，導致推理速度較慢。（C2）參數和計算效率低下。KAN需要為每個輸入-輸出對應一個獨特的函數，使計算量極大。（C3）權重初始化。由於KAN中的可學習激活函數對於實現深度神經網絡的收斂至關重要，因此權重的初始化尤為具有挑戰性。為了克服上述挑戰，我們提出了三個關鍵解決方案：（S1）有理基礎。我們將B樣條函數替換為有理函數，以提高與現代GPU的兼容性。通過在CUDA中實現這一點，我們實現了更快的計算。（S2）組KAN。我們通過一組神經元共享激活權重，以減少計算負載而不影響性能。（S3）保持變異初始化。我們精心初始化激活權重，以確保激活變異在各層之間保持一致。通過這些設計，KAT能夠有效擴展並輕鬆勝過傳統基於MLP的Transformer。

English

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

科爾莫哥洛夫-阿諾德變換器

Kolmogorov-Arnold Transformer

摘要

Summary

Support

Support