CompCap:透過複合標題改進多模式大型語言模型
CompCap: Improving Multimodal Large Language Models with Composite Captions
December 6, 2024
作者: Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He
cs.AI
摘要
多模式大型語言模型(MLLMs)能有多好地理解複合圖像呢?複合圖像(CIs)是合成視覺圖像,通過合併多個視覺元素(如圖表、海報或截圖)而創建,而非直接由相機拍攝。儘管CIs在現實應用中很常見,但最近MLLM的發展主要集中在解釋自然圖像(NIs)。我們的研究顯示,目前的MLLM在準確理解CIs方面面臨著重大挑戰,通常難以提取信息或基於這些圖像進行複雜推理。我們發現,現有的CIs訓練數據主要針對問答任務格式化(例如在ChartQA和ScienceQA等數據集中),而對於堅固的視覺語言對齊至關重要的高質量圖像說明數據集僅適用於NIs。為彌合這一差距,我們引入了複合說明(CompCap),這是一個靈活的框架,利用大型語言模型(LLMs)和自動化工具來合成具有準確詳細說明的CIs。利用CompCap,我們編纂了CompCap-118K,其中包含六種CIs類型的118K圖像說明對。我們通過監督微調三種尺寸的MLLMs:xGen-MM-inst.-4B和LLaVA-NeXT-Vicuna-7B/13B,驗證了CompCap-118K的有效性。實證結果表明,CompCap-118K顯著增強了MLLM對CIs的理解,分別在十一個基準測試中平均增益為1.7%、2.0%和2.9%。
English
How well can Multimodal Large Language Models (MLLMs) understand composite
images? Composite images (CIs) are synthetic visuals created by merging
multiple visual elements, such as charts, posters, or screenshots, rather than
being captured directly by a camera. While CIs are prevalent in real-world
applications, recent MLLM developments have primarily focused on interpreting
natural images (NIs). Our research reveals that current MLLMs face significant
challenges in accurately understanding CIs, often struggling to extract
information or perform complex reasoning based on these images. We find that
existing training data for CIs are mostly formatted for question-answer tasks
(e.g., in datasets like ChartQA and ScienceQA), while high-quality
image-caption datasets, critical for robust vision-language alignment, are only
available for NIs. To bridge this gap, we introduce Composite Captions
(CompCap), a flexible framework that leverages Large Language Models (LLMs) and
automation tools to synthesize CIs with accurate and detailed captions. Using
CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs
across six CI types. We validate the effectiveness of CompCap-118K by
supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and
LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K
significantly enhances MLLMs' understanding of CIs, yielding average gains of
1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.Summary
AI-Generated Summary