γ-MoD：探索混合深度適應對於多模態大型語言模型的影響

摘要

儘管多模式大型語言模型（MLLMs）取得了顯著進展，但其高計算成本仍然是實際部署的障礙。受自然語言處理中深度混合（MoDs）的啟發，我們旨在從性能的角度解決這一限制，即從“啟動標記”角度。我們的關鍵洞察是，如果大多數標記對於層計算是多餘的，則可以通過MoD層直接跳過。然而，將MLLMs的密集層直接轉換為MoD層會導致顯著的性能下降。為解決此問題，我們提出了一種創新的MoD適應策略，稱為gamma-MoD。在gamma-MoD中，提出了一種新的指標來指導MLLM中MoDs的部署，即注意力圖的排名（ARank）。通過ARank，我們可以有效地識別哪些層是多餘的，應該用MoD層替換。基於ARank，我們進一步提出了兩種新設計，以最大程度地提高MLLM的計算稀疏性，同時保持其性能，即共享視覺-語言路由器和遮罩路由學習。通過這些設計，MLLM的超過90%的密集層可以有效轉換為MoD層。為驗證我們的方法，我們將其應用於三種流行的MLLM，並在9個基準數據集上進行了大量實驗。實驗結果不僅驗證了gamma-MoD對現有MLLM的顯著效率優勢，還確認了其對各種MLLM的泛化能力。例如，通過微小的性能下降，即-1.5%，gamma-MoD可以將LLaVA-HR的訓練和推理時間分別減少31.0%和53.2%。

English

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called gamma-MoD. In gamma-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of gamma-MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, gamma-MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.

γ-MoD：探索混合深度適應對於多模態大型語言模型的影響

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

摘要

Summary

Support

Support