探索多粒度概念标注以用于多模态大型语言模型
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
December 8, 2024
作者: Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan
cs.AI
摘要
多模态大型语言模型(MLLMs)通过仅在粗粒度概念注释(例如图像标题)上进行预训练,在视觉-语言任务中表现出色。我们假设整合细粒度概念注释(例如对象标签和对象区域)将进一步提高性能,因为两种数据粒度在概念表示的广度和深度方面互补。我们为MLLMs引入了一个新的数据集,其中包含多模态多粒度概念注释(MMGiC)。在构建MMGiC时,我们探讨了不同数据配方对多模态理解和生成的影响。我们的分析显示,多粒度概念注释在我们的结构化模板和通用MLLM框架下相互整合和互补。我们清楚地探讨并展示了MMGiC帮助MLLMs更好地定位和学习概念的潜力,将视觉和语言在多个粒度上对齐。我们通过研究MMGiC与图像标题数据在12个多模态理解和生成基准上的公平比较和有效协作来进一步验证我们的假设,例如它们的适当组合在POPE和SEED-Bench上相对于仅图像标题数据分别实现了3.95%和2.34%的绝对改进。代码、数据和模型将在https://github.com/LooperXX/MMGiC 上提供。
English
Multimodal Large Language Models (MLLMs) excel in vision--language tasks by
pre-training solely on coarse-grained concept annotations (e.g., image
captions). We hypothesize that integrating fine-grained concept annotations
(e.g., object labels and object regions) will further improve performance, as
both data granularities complement each other in terms of breadth and depth in
concept representation. We introduce a new dataset featuring Multimodal
Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we
explore the impact of different data recipes on multimodal comprehension and
generation. Our analyses reveal that multi-grained concept annotations
integrate and complement each other, under our structured template and a
general MLLM framework. We clearly explore and demonstrate the potential of
MMGiC to help MLLMs better locate and learn concepts, aligning vision and
language at multiple granularities. We further validate our hypothesis by
investigating the fair comparison and effective collaboration between MMGiC and
image--caption data on 12 multimodal comprehension and generation benchmarks,
e.g., their appropriate combination achieve 3.95% and 2.34% absolute
improvements over image--caption data alone on POPE and SEED-Bench. Code, data
and models will be available at https://github.com/LooperXX/MMGiC.Summary
AI-Generated Summary