探索多粒度概念标注以用于多模态大型语言模型

摘要

多模态大型语言模型（MLLMs）通过仅在粗粒度概念注释（例如图像标题）上进行预训练，在视觉-语言任务中表现出色。我们假设整合细粒度概念注释（例如对象标签和对象区域）将进一步提高性能，因为两种数据粒度在概念表示的广度和深度方面互补。我们为MLLMs引入了一个新的数据集，其中包含多模态多粒度概念注释（MMGiC）。在构建MMGiC时，我们探讨了不同数据配方对多模态理解和生成的影响。我们的分析显示，多粒度概念注释在我们的结构化模板和通用MLLM框架下相互整合和互补。我们清楚地探讨并展示了MMGiC帮助MLLMs更好地定位和学习概念的潜力，将视觉和语言在多个粒度上对齐。我们通过研究MMGiC与图像标题数据在12个多模态理解和生成基准上的公平比较和有效协作来进一步验证我们的假设，例如它们的适当组合在POPE和SEED-Bench上相对于仅图像标题数据分别实现了3.95%和2.34%的绝对改进。代码、数据和模型将在https://github.com/LooperXX/MMGiC 上提供。

English

Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.

探索多粒度概念标注以用于多模态大型语言模型

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

摘要

Summary

Support

Support