루터 튜닝: 트랜스포머에서 동적 깊이를 활성화하기 위한 간단하고 효과적인 방법

초록

전통적인 트랜스포머 모델은 각 입력 토큰에 일정한 계산 자원을 할당하여 비효율적이고 불필요한 계산을 유발합니다. 이를 해결하기 위해 계산 깊이를 동적으로 조절하는 Mixture of Depths (MoD)가 도입되었습니다. 약속된 것과는 달리 현재의 MoD 접근 방식은 미개척된 상태에 있으며 두 가지 주요 도전에 직면하고 있습니다: (1) 어떤 레이어를 건너뛌지 결정하는 라우터를 함께 훈련해야 하기 때문에 발생하는 높은 훈련 비용, 그리고 (2) 중요한 레이어를 건너뛸 때 성능 저하의 위험. 첫 번째 문제에 대응하기 위해 작은 데이터셋에서 라우터만을 세밀하게 튜닝하는 Router-Tuning 방법을 제안합니다. 이는 전체 모델 훈련에 따른 계산 부담을 크게 줄입니다. 두 번째 도전에 대응하기 위해 성능을 유지하면서 계산 및 메모리 효율성을 크게 향상시키는 Attention with Dynamic Depths를 적용한 MindSkip를 제안합니다. 광범위한 실험 결과, 우리의 접근 방식이 경쟁력 있는 결과를 제공하면서 계산 효율성을 현저히 향상시키는 것을 입증했습니다. 예를 들어, 21%의 가속화와 0.2%의 성능 저하만 발생합니다. 코드는 https://github.com/CASE-Lab-UMD/Router-Tuning에서 공개되어 있습니다.

English

Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at https://github.com/CASE-Lab-UMD/Router-Tuning.

루터 튜닝: 트랜스포머에서 동적 깊이를 활성화하기 위한 간단하고 효과적인 방법

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

초록

Support