TokenFormer:重新思考使用Token化模型的Transformer參數規模
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
October 30, 2024
作者: Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele
cs.AI
摘要
由於在各個領域表現出色,Transformer已成為基礎模型中佔主導地位的架構。然而,擴展這些模型的巨大成本仍然是一個重大問題。這個問題主要源於它們依賴於線性投影中固定數量參數。當引入架構修改(例如通道維度)時,整個模型通常需要從頭開始重新訓練。隨著模型尺寸的持續增長,這種策略導致計算成本不斷增加,變得不可持續。為了克服這個問題,我們引入了TokenFormer,這是一種本來就可擴展的架構,利用注意機制不僅用於輸入標記之間的計算,還用於標記與模型參數之間的交互作用,從而增強架構的靈活性。通過將模型參數視為標記,我們用我們的標記-參數注意力層取代Transformer中的所有線性投影,其中輸入標記充當查詢,模型參數充當鍵和值。這種重新制定使得可以逐步且有效地擴展模型,而無需重新從頭訓練。我們的模型通過逐步添加新的鍵-值參數對,將參數從124M擴展到1.4B,實現了與從頭訓練的Transformer相當的性能,同時大大降低了訓練成本。代碼和模型可在https://github.com/Haiyang-W/TokenFormer找到。
English
Transformers have become the predominant architecture in foundation models
due to their excellent performance across various domains. However, the
substantial cost of scaling these models remains a significant concern. This
problem arises primarily from their dependence on a fixed number of parameters
within linear projections. When architectural modifications (e.g., channel
dimensions) are introduced, the entire model typically requires retraining from
scratch. As model sizes continue growing, this strategy results in increasingly
high computational costs and becomes unsustainable. To overcome this problem,
we introduce TokenFormer, a natively scalable architecture that leverages the
attention mechanism not only for computations among input tokens but also for
interactions between tokens and model parameters, thereby enhancing
architectural flexibility. By treating model parameters as tokens, we replace
all the linear projections in Transformers with our token-parameter attention
layer, where input tokens act as queries and model parameters as keys and
values. This reformulation allows for progressive and efficient scaling without
necessitating retraining from scratch. Our model scales from 124M to 1.4B
parameters by incrementally adding new key-value parameter pairs, achieving
performance comparable to Transformers trained from scratch while greatly
reducing training costs. Code and models are available at
https://github.com/Haiyang-W/TokenFormer.Summary
AI-Generated Summary