Migician:揭示多模態大型語言模型中自由形式多圖像基礎的魔法
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
January 10, 2025
作者: You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun
cs.AI
摘要
最近多模式大型語言模型(MLLMs)的進步顯著提升了它們對單張圖像的細緻感知和跨多張圖像的整體理解能力。然而,現有的 MLLMs 仍然在實現複雜多圖像情境中的精確鋪陳方面面臨挑戰。為了應對這一問題,我們首先探索了一個「Chain-of-Thought(CoT)」框架,將單張圖像的鋪陳與多張圖像的理解相結合。儘管部分有效,但由於其非端對端的特性,仍然存在不穩定性並難以捕捉抽象的視覺信息。因此,我們引入了 Migician,這是第一個能夠在多張圖像之間進行自由形式和準確鋪陳的多圖像鋪陳模型。為了支持這一點,我們提出了 MGrounding-630k 數據集,其中包含從現有數據集衍生的幾個多圖像鋪陳任務的數據,以及新生成的自由形式鋪陳指示遵循數據。此外,我們提出了 MIG-Bench,這是一個專門設計用於評估多圖像鋪陳能力的全面基準。實驗結果表明,我們的模型實現了顯著優越的多圖像鋪陳能力,比最佳現有的 MLLMs 高出 21.61%,甚至超越了規模更大的 70B 模型。我們的代碼、模型、數據集和基準均已完全開源。
English
The recent advancement of Multimodal Large Language Models (MLLMs) has
significantly improved their fine-grained perception of single images and
general comprehension across multiple images. However, existing MLLMs still
face challenges in achieving precise grounding in complex multi-image
scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework
that integrates single-image grounding with multi-image comprehension. While
partially effective, it remains unstable and struggles to capture abstract
visual information due to its non-end-to-end nature. Therefore, we introduce
Migician, the first multi-image grounding model capable of performing free-form
and accurate grounding across multiple images. To support this, we present the
MGrounding-630k dataset, which comprises data for several multi-image grounding
tasks derived from existing datasets, along with newly generated free-form
grounding instruction-following data. Furthermore, we propose MIG-Bench, a
comprehensive benchmark specifically designed for evaluating multi-image
grounding capabilities. Experimental results demonstrate that our model
achieves significantly superior multi-image grounding capabilities,
outperforming the best existing MLLMs by 21.61% and even surpassing much larger
70B models. Our code, model, dataset, and benchmark are fully open-sourced.Summary
AI-Generated Summary