Migician：揭示多模态大型语言模型中自由形式多图像基础的魔法

摘要

最近多模态大型语言模型（MLLMs）的进展显著提高了它们对单个图像的细粒度感知和跨多个图像的一般理解能力。然而，现有的MLLMs在复杂的多图像场景中仍面临着精确定位的挑战。为了解决这个问题，我们首先探索了一个“思维链”（CoT）框架，将单图像定位与多图像理解相结合。虽然在一定程度上有效，但由于其非端到端的特性，它仍然不稳定，并且难以捕捉抽象的视觉信息。因此，我们引入了Migician，这是第一个能够在多个图像之间执行自由形式和准确定位的多图像定位模型。为了支持这一点，我们提出了MGrounding-630k数据集，其中包含了从现有数据集衍生的几个多图像定位任务的数据，以及新生成的自由形式定位指令跟随数据。此外，我们提出了MIG-Bench，这是一个专门设计用于评估多图像定位能力的全面基准。实验结果表明，我们的模型实现了显著优越的多图像定位能力，比现有最佳MLLMs提高了21.61%，甚至超过了规模更大的70B模型。我们的代码、模型、数据集和基准均已完全开源。

English

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced.

Migician：揭示多模态大型语言模型中自由形式多图像基础的魔法

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

摘要

Summary

Support