路由器調整：實現Transformer動態深度的簡單有效方法

摘要

傳統的Transformer模型通常會為每個輸入標記分配固定的計算資源，導致計算效率低下且存在不必要的計算。為了解決這個問題，引入了混合深度（MoD），以動態調整計算深度，跳過較不重要的層。儘管MoD的前景看好，但目前的方法仍未得到充分探索，並面臨兩個主要挑戰：（1）由於需要訓練整個模型以及確定要跳過哪些層的路由器，導致高昂的訓練成本，以及（2）當重要層被跳過時，性能下降的風險。針對第一個問題，我們提出了Router-Tuning方法，僅在小型數據集上微調路由器，從而大幅降低與完整模型訓練相關的計算開銷。針對第二個挑戰，我們提出了MindSkip，採用具有動態深度的注意力機制。該方法在顯著提高計算和記憶效率的同時保持了模型的性能。大量實驗表明，我們的方法提供了競爭力強的結果，同時顯著提高了計算效率，例如提速21％，僅有0.2％的性能下降。程式碼已在https://github.com/CASE-Lab-UMD/Router-Tuning 釋出。

English

Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at https://github.com/CASE-Lab-UMD/Router-Tuning.

路由器調整：實現Transformer動態深度的簡單有效方法

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

摘要

Summary

Support

Support