p-MoD: 점진적 비율 감소를 통해 혼합 깊이 다중 언어 모델(Mixture-of-Depths MLLMs) 구축하기

초록

다양한 작업에서 다채로운 성과를 보여주는 다중 모달 대형 언어 모델(MLLMs)의 성능에도 불구하고, 상당한 훈련 및 추론 비용이 그들의 발전을 방해하고 있다. 계산의 대부분은 트랜스포머 디코더에서 처리되는 압도적인 양의 비전 토큰에서 비롯된다. 본 논문에서는 각 트랜스포머 디코더 레이어가 중요한 비전 토큰을 선택하고 중복된 것들을 건너뛰는 Mixture-of-Depths (MoD) 메커니즘을 활용하여 효율적인 MLLMs를 구축하는 것을 제안한다. 그러나 MoD를 MLLMs에 통합하는 것은 쉽지 않다. 훈련 및 추론 안정성 및 제한된 훈련 데이터의 도전에 대처하기 위해 우리는 두 가지 새로운 디자인인 tanh-게이트 가중치 정규화(TanhNorm)와 대칭 토큰 재가중치화(STRing)를 사용하여 MoD 모듈을 적응시킨다. 더불어, 우리는 비전 토큰이 깊은 레이어에서 더 높은 중복성을 보이며, 이에 따라 토큰 보존 비율을 점진적으로 감소시키는 progressive ratio decay (PRD) 전략을 설계한다. 이 핵심적인 디자인은 MoD의 잠재력을 완전히 발휘하여 모델의 효율성과 성능을 크게 향상시킨다. 우리의 방법의 효과를 검증하기 위해 14개의 벤치마크에서 두 개의 베이스라인 모델과 광범위한 실험을 수행한다. 우리의 모델인 p-MoD는 추론 중에는 베이스라인 모델의 성능을 맞거나 능가하며, 훈련 중에는 GPU 시간의 77.7%, 추론 중에는 TFLOPs의 55.6% 및 KV 캐시 저장소의 53.8%만을 사용한다.

English

Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

p-MoD: 점진적 비율 감소를 통해 혼합 깊이 다중 언어 모델(Mixture-of-Depths MLLMs) 구축하기

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

초록

Summary

Support