ChatPaper.aiChatPaper

CompCap:利用复合标题改进多模态大型语言模型

CompCap: Improving Multimodal Large Language Models with Composite Captions

December 6, 2024
作者: Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He
cs.AI

摘要

多模式大型语言模型(MLLMs)能有多好地理解复合图像?复合图像(CIs)是通过合并多个视觉元素(如图表、海报或屏幕截图)而创建的合成视觉,而非直接由摄像头拍摄而成。虽然CIs在现实应用中很常见,但最近MLLM的发展主要集中在解释自然图像(NIs)上。我们的研究发现,目前的MLLM在准确理解CIs方面面临着重大挑战,通常难以提取信息或基于这些图像进行复杂推理。我们发现,现有的CIs训练数据主要针对问答任务进行格式化(例如在ChartQA和ScienceQA等数据集中),而对于稳健的视觉-语言对齐至关重要的高质量图像描述数据集仅适用于NIs。为了弥合这一差距,我们引入了复合描述(CompCap),这是一个灵活的框架,利用大型语言模型(LLMs)和自动化工具来合成具有准确和详细描述的CIs。利用CompCap,我们策划了CompCap-118K,这是一个包含六种CI类型的118K图像描述对的数据集。我们通过对三种规模的MLLM进行监督微调(xGen-MM-inst.-4B和LLaVA-NeXT-Vicuna-7B/13B)来验证CompCap-118K的有效性。实证结果表明,CompCap-118K显著增强了MLLM对CIs的理解能力,在十一个基准测试中分别获得了1.7%、2.0%和2.9%的平均增益。
English
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.

Summary

AI-Generated Summary

PDF194December 9, 2024