Awaker2.5-VL : Mise à l'échelle stable des MLLM avec un mélange d'experts efficace en paramètres

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

November 16, 2024
Auteurs: Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu
cs.AI

Résumé

À mesure que la recherche sur les Modèles de Langage Multimodaux de Grande Taille (MLLM) devient populaire, un modèle MLLM avancé est généralement requis pour gérer simultanément diverses tâches textuelles et visuelles (par exemple, VQA, Détection, OCR et ChartQA) pour des applications du monde réel. Cependant, en raison des différences significatives de représentation et de distribution parmi les données provenant de différentes tâches, mélanger simplement les données de toutes les tâches ensemble conduit au problème bien connu de "conflit multi-tâches", entraînant une dégradation des performances dans diverses tâches. Pour résoudre ce problème, nous proposons Awaker2.5-VL, une architecture Mixture of Experts (MoE) adaptée aux MLLM, qui acquiert les capacités multi-tâches grâce à plusieurs experts activés de manière dispersée. Pour accélérer l'entraînement et l'inférence d'Awaker2.5-VL, chaque expert de notre modèle est conçu comme une structure d'adaptation à faible rang (LoRA). Des expériences approfondies sur plusieurs derniers bancs d'essai démontrent l'efficacité d'Awaker2.5-VL. Le code et le modèle pré-entraîné sont disponibles sur notre Page de Projet : https://github.com/MetabrainAGI/Awaker.
English
As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.

Summary

AI-Generated Summary

Paper Overview

This paper introduces Awaker2.5-VL, an architecture based on a Mixture of Experts (MoE) for Multimodal Large Language Models (MLLM). Awaker2.5-VL excels in various benchmarks, showcasing superior performance in perception and overall scores compared to state-of-the-art models.

Core Contribution

Awaker2.5-VL, utilizing a MoE approach, demonstrates exceptional performance in various benchmarks, particularly in perception and overall scores, surpassing existing models. It addresses the "multi-task conflict" issue in Multimodal Large Language Models effectively.

Research Context

The research focuses on advancing Multimodal Large Language Models (MLLM) by introducing the Awaker2.5-VL architecture, emphasizing the use of a Mixture of Experts (MoE) for improved performance in visual and textual tasks simultaneously.

Keywords

  • Mixture of Experts (MoE)
  • Multimodal Large Language Models (MLLM)
  • Benchmarking (MME-Realworld, MMBench)
  • Perception and Reasoning Scores
  • Expert Activation and Routing

Background

The transition from Large Language Models (LLM) to Multimodal Large Language Models (MLLM) aims to handle both textual and visual tasks concurrently. However, simple data fusion in MLLMs can lead to multi-task conflicts, necessitating innovative architectures like Awaker2.5-VL.

Research Gap

Existing MLLMs face challenges with multi-task conflicts due to straightforward data integration, highlighting the need for specialized architectures like Awaker2.5-VL with a Mixture of Experts approach.

Technical Challenges

The "multi-task conflict" issue in MLLMs poses a significant technical obstacle, requiring novel architectures like Awaker2.5-VL to efficiently handle diverse tasks without interference.

Prior Approaches

Previous approaches in MLLMs lacked the sophistication to address multi-task conflicts effectively, underscoring the necessity for advanced models like Awaker2.5-VL with its Mixture of Experts design.

Methodology

Awaker2.5-VL is built on a Mixture of Experts (MoE) framework, enhancing performance in various benchmarks through expert activation and routing strategies, such as LoRA adaptation structures.

Theoretical Foundation

Awaker2.5-VL's architecture is grounded in the Mixture of Experts (MoE) concept, where sparse expert activation for each task is crucial for efficient training and inference.

Technical Architecture

Awaker2.5-VL employs a Mixture of Experts (MoE) structure operating at the instance level, utilizing stable routing strategies to optimize task performance.

Implementation Details

The training process of Awaker2.5-VL involves three stages: Initialization, MoE Training, and Instruction Adjustment, with each expert designed as a Low-rank Adaptation structure for enhanced efficiency.

Innovation Points

Awaker2.5-VL introduces a novel Mixture of Experts (MoE) architecture with LoRA structures, showcasing superior performance in various benchmarks compared to existing models.

Experimental Validation

Awaker2.5-VL's performance is validated through experiments across multiple benchmarks, demonstrating its superiority in perception and overall scores.

Setup

Awaker2.5-VL's validation includes benchmarks like MME-RealWorld, MMBench-CN, and MME-Realworld-CN, showcasing its exceptional performance in perception and overall scores.

Metrics

Performance metrics such as perception and reasoning scores are used to evaluate Awaker2.5-VL's effectiveness in handling diverse tasks within Multimodal Large Language Models.

Results

Awaker2.5-VL outperforms competitors in various benchmarks, maintaining its lead in perception and overall scores, despite a slight decrease in reasoning compared to the state-of-the-art.

Comparative Analysis

Awaker2.5-VL's architecture, particularly its Mixture of Experts (MoE) design, surpasses competitors in benchmarks like MME-Realworld and MMBench, highlighting its technical advancements.

Impact and Implications

Awaker2.5-VL's innovative approach has significant implications for the field of Multimodal Large Language Models, paving the way for improved task handling and performance enhancements.

Key Findings

Awaker2.5-VL demonstrates exceptional performance in perception and overall scores across various benchmarks, showcasing the effectiveness of its Mixture of Experts (MoE) architecture.

Limitations

While Awaker2.5-VL excels in perception and overall scores, there is a slight decrease in reasoning compared to the state-of-the-art, indicating areas for further improvement.

Future Directions

Future research directions include enhancing query representations for improved routing performance and extending the MoE model to the ViT part of the multimodal model, presenting concrete opportunities for advancement.

Practical Significance

Awaker2.5-VL's advancements have practical implications for real-world applications, offering improved performance in handling diverse textual and visual tasks simultaneously.

Articles en Vedette

DeepSeek-R1 : Encourager la capacité de raisonnement dans les LLMs via l'apprentissage par renforcement
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253685

Rapport technique de Qwen2.5
Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202436311

MiniMax-01 : Mise à l'échelle des modèles de base avec Attention Éclair.
MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia WuJan 14, 20252826

PDF102November 19, 2024