Awaker2.5-VL:具有參數效率的專家混合的穩定擴展MLLMs
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
November 16, 2024
作者: Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu
cs.AI
摘要
隨著多模態大型語言模型(MLLMs)研究的普及,一個先進的MLLM模型通常需要同時處理各種文本和視覺任務(例如VQA、檢測、OCR和ChartQA)以應用於現實世界中。然而,由於不同任務數據之間的表示和分佈存在顯著差異,簡單地將所有任務的數據混合在一起會導致眾所周知的“多任務衝突”問題,進而導致各種任務的性能下降。為了解決這個問題,我們提出了Awaker2.5-VL,這是一種適用於MLLM的專家混合(MoE)架構,通過多個稀疏激活的專家獲得多任務能力。為了加快Awaker2.5-VL的訓練和推理速度,我們模型中的每個專家都被設計為低秩適應(LoRA)結構。對多個最新基準測試的大量實驗證明了Awaker2.5-VL的有效性。代碼和模型權重已在我們的項目頁面上發布:https://github.com/MetabrainAGI/Awaker。
English
As the research of Multimodal Large Language Models (MLLMs) becomes popular,
an advancing MLLM model is typically required to handle various textual and
visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for
real-world applications. However, due to the significant differences in
representation and distribution among data from various tasks, simply mixing
data of all tasks together leads to the well-known``multi-task conflict" issue,
resulting in performance degradation across various tasks. To address this
issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture
suitable for MLLM, which acquires the multi-task capabilities through multiple
sparsely activated experts. To speed up the training and inference of
Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation
(LoRA) structure. Extensive experiments on multiple latest benchmarks
demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are
released in our Project Page: https://github.com/MetabrainAGI/Awaker.Summary
AI-Generated Summary