콜모고로프-아놀드 변환기

초록

트랜스포머는 현대 딥러닝의 중심 요소로 자리 잡고 있습니다. 전통적으로 이러한 모델은 다층 퍼셉트론(MLP) 레이어를 사용하여 채널 간 정보를 섞습니다. 본 논문에서는 MLP 레이어를 Kolmogorov-Arnold Network (KAN) 레이어로 대체하여 모델의 표현력과 성능을 향상시키는 새로운 구조인 Kolmogorov-Arnold Transformer (KAT)를 소개합니다. 그러나 트랜스포머에 KAN을 통합하는 것은 특히 규모를 확장할 때 쉽지 않은 일입니다. 구체적으로 세 가지 주요 도전 과제를 확인합니다: (C1) 기본 함수. KAN에서 사용되는 표준 B-스플라인 함수는 현대 하드웨어에서 병렬 컴퓨팅에 최적화되어 있지 않아 추론 속도가 느려집니다. (C2) 매개변수 및 계산 비효율성. KAN은 각 입력-출력 쌍마다 고유한 함수를 필요로 하므로 계산이 매우 커집니다. (C3) 가중치 초기화. KAN의 가중치 초기화는 깊은 신경망에서 수렴을 달성하는 데 중요한 학습 가능한 활성화 함수로 인해 특히 어려움을 겪습니다. 상기 도전 과제를 극복하기 위해 세 가지 주요 해결책을 제안합니다: (S1) 합리적 기저. B-스플라인 함수를 합리적 함수로 대체하여 현대 GPU와의 호환성을 향상시킵니다. CUDA에서 이를 구현함으로써 빠른 계산을 달성합니다. (S2) 그룹 KAN. 활성화 가중치를 뉴런 그룹을 통해 공유하여 계산 부하를 줄이면서도 성능을 희생하지 않습니다. (S3) 분산 보존 초기화. 활성화 가중치를 신중하게 초기화하여 활성화 분산이 레이어 간에 유지되도록 합니다. 이러한 설계로 KAT는 효과적으로 확장되며 전통적인 MLP 기반 트랜스포머를 쉽게 능가합니다.

English

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

콜모고로프-아놀드 변환기

Kolmogorov-Arnold Transformer

초록

Summary

Support

Support