Chimera:利用领域专家改进通用模型
Chimera: Improving Generalist Model with Domain-Specific Experts
December 8, 2024
作者: Tianshuo Peng, Mingsheng Li, Hongbin Zhou, Renqiu Xia, Renrui Zhang, Lei Bai, Song Mao, Bin Wang, Conghui He, Aojun Zhou, Botian Shi, Tao Chen, Bo Zhang, Xiangyu Yue
cs.AI
摘要
最近对大型多模态模型(LMMs)的进展强调了通过增加图像-文本配对数据来扩展规模的重要性,在通用任务上取得了令人印象深刻的性能。尽管这些通用模型在广泛应用中表现有效,但它们主要是在以自然图像为主导的Web规模数据集上训练的,导致牺牲了需要大量领域先验知识的特定领域任务的专业能力。此外,由于通用模型和专家模型之间的表征差距和优化不平衡,直接整合专门针对特定领域定制的专家模型具有挑战性。为了解决这些挑战,我们引入了Chimera,这是一个可扩展且低成本的多模态流水线,旨在通过领域专家来增强现有LMMs的能力。具体来说,我们设计了一种渐进式训练策略,将专家模型的特征集成到通用LMM的输入中。为了解决由于良好对齐的通用视觉编码器引起的优化不平衡,我们引入了一种新颖的通用-专家协作掩蔽(GSCM)机制。这产生了一个多才多艺的模型,擅长图表、表格、数学和文档领域,在多模态推理和视觉内容提取任务上取得了最先进的性能,这两项任务对于评估现有LMMs来说都是具有挑战性的任务。
English
Recent advancements in Large Multi-modal Models (LMMs) underscore the
importance of scaling by increasing image-text paired data, achieving
impressive performance on general tasks. Despite their effectiveness in broad
applications, generalist models are primarily trained on web-scale datasets
dominated by natural images, resulting in the sacrifice of specialized
capabilities for domain-specific tasks that require extensive domain prior
knowledge. Moreover, directly integrating expert models tailored for specific
domains is challenging due to the representational gap and imbalanced
optimization between the generalist model and experts. To address these
challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline
designed to boost the ability of existing LMMs with domain-specific experts.
Specifically, we design a progressive training strategy to integrate features
from expert models into the input of a generalist LMM. To address the
imbalanced optimization caused by the well-aligned general visual encoder, we
introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism.
This results in a versatile model that excels across the chart, table, math,
and document domains, achieving state-of-the-art performance on multi-modal
reasoning and visual content extraction tasks, both of which are challenging
tasks for assessing existing LMMs.Summary
AI-Generated Summary