Awaker2.5-VL: 매개변수 효율적 전문가 혼합을 사용한 MLLM의 안정적인 확장

초록

다중 모달 대규모 언어 모델 (MLLMs) 연구가 인기를 얻으면서, 발전하는 MLLM 모델은 일반적으로 실제 응용 프로그램을 위해 동시에 다양한 텍스트 및 시각 작업 (예: VQA, Detection, OCR 및 ChartQA)을 처리해야 합니다. 그러나 다양한 작업에서 데이터의 표현 및 분포 사이에 상당한 차이가 있기 때문에 단순히 모든 작업의 데이터를 섞는 것은 잘 알려진 "다중 작업 갈등" 문제로 이어지며, 다양한 작업에서 성능 저하로 이어집니다. 이 문제를 해결하기 위해, 우리는 MLLM에 적합한 Mixture of Experts~(MoE) 아키텍처인 Awaker2.5-VL을 제안합니다. 이 아키텍처는 여러 희소하게 활성화된 전문가들을 통해 다중 작업 능력을 습득합니다. Awaker2.5-VL의 훈련 및 추론 속도를 높이기 위해 우리 모델의 각 전문가는 저랭크 적응 (LoRA) 구조로 구성됩니다. 다양한 최신 벤치마크에서 수행된 실험은 Awaker2.5-VL의 효과를 입증합니다. 코드와 모델 가중치는 저희 프로젝트 페이지에서 공개되어 있습니다: https://github.com/MetabrainAGI/Awaker.

English

As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.

Awaker2.5-VL: 매개변수 효율적 전문가 혼합을 사용한 MLLM의 안정적인 확장

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

초록

Support