Migician：揭示多模態大型語言模型中自由形式多圖像基礎的魔法

摘要

最近多模式大型語言模型（MLLMs）的進步顯著提升了它們對單張圖像的細緻感知和跨多張圖像的整體理解能力。然而，現有的 MLLMs 仍然在實現複雜多圖像情境中的精確鋪陳方面面臨挑戰。為了應對這一問題，我們首先探索了一個「Chain-of-Thought（CoT）」框架，將單張圖像的鋪陳與多張圖像的理解相結合。儘管部分有效，但由於其非端對端的特性，仍然存在不穩定性並難以捕捉抽象的視覺信息。因此，我們引入了 Migician，這是第一個能夠在多張圖像之間進行自由形式和準確鋪陳的多圖像鋪陳模型。為了支持這一點，我們提出了 MGrounding-630k 數據集，其中包含從現有數據集衍生的幾個多圖像鋪陳任務的數據，以及新生成的自由形式鋪陳指示遵循數據。此外，我們提出了 MIG-Bench，這是一個專門設計用於評估多圖像鋪陳能力的全面基準。實驗結果表明，我們的模型實現了顯著優越的多圖像鋪陳能力，比最佳現有的 MLLMs 高出 21.61%，甚至超越了規模更大的 70B 模型。我們的代碼、模型、數據集和基準均已完全開源。

English

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced.

Migician：揭示多模態大型語言模型中自由形式多圖像基礎的魔法

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

摘要

Support