Migician:揭示多模态大型语言模型中自由形式多图像基础的魔法
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
January 10, 2025
作者: You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun
cs.AI
摘要
最近多模态大型语言模型(MLLMs)的进展显著提高了它们对单个图像的细粒度感知和跨多个图像的一般理解能力。然而,现有的MLLMs在复杂的多图像场景中仍面临着精确定位的挑战。为了解决这个问题,我们首先探索了一个“思维链”(CoT)框架,将单图像定位与多图像理解相结合。虽然在一定程度上有效,但由于其非端到端的特性,它仍然不稳定,并且难以捕捉抽象的视觉信息。因此,我们引入了Migician,这是第一个能够在多个图像之间执行自由形式和准确定位的多图像定位模型。为了支持这一点,我们提出了MGrounding-630k数据集,其中包含了从现有数据集衍生的几个多图像定位任务的数据,以及新生成的自由形式定位指令跟随数据。此外,我们提出了MIG-Bench,这是一个专门设计用于评估多图像定位能力的全面基准。实验结果表明,我们的模型实现了显著优越的多图像定位能力,比现有最佳MLLMs提高了21.61%,甚至超过了规模更大的70B模型。我们的代码、模型、数据集和基准均已完全开源。
English
The recent advancement of Multimodal Large Language Models (MLLMs) has
significantly improved their fine-grained perception of single images and
general comprehension across multiple images. However, existing MLLMs still
face challenges in achieving precise grounding in complex multi-image
scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework
that integrates single-image grounding with multi-image comprehension. While
partially effective, it remains unstable and struggles to capture abstract
visual information due to its non-end-to-end nature. Therefore, we introduce
Migician, the first multi-image grounding model capable of performing free-form
and accurate grounding across multiple images. To support this, we present the
MGrounding-630k dataset, which comprises data for several multi-image grounding
tasks derived from existing datasets, along with newly generated free-form
grounding instruction-following data. Furthermore, we propose MIG-Bench, a
comprehensive benchmark specifically designed for evaluating multi-image
grounding capabilities. Experimental results demonstrate that our model
achieves significantly superior multi-image grounding capabilities,
outperforming the best existing MLLMs by 21.61% and even surpassing much larger
70B models. Our code, model, dataset, and benchmark are fully open-sourced.Summary
AI-Generated Summary