Migician: マルチモーダルな大規模言語モデルにおけるフリーフォームのマルチ画像グラウンディングの魔法を明らかにする

要旨

最近、Multimodal Large Language Models（MLLMs）の進歩により、単一画像の微細な知覚と複数画像全体の理解が大幅に向上しました。ただし、既存のMLLMsは、複雑な複数画像シナリオでの正確なグラウンディングを実現する際に依然として課題に直面しています。この課題に対処するために、まず、Chain-of-Thought（CoT）フレームワークを探求しました。このフレームワークは、単一画像のグラウンディングと複数画像の理解を統合しています。部分的に効果がありますが、エンドツーエンドではない性質から、抽象的な視覚情報を捉えるのに苦労しています。そのため、我々は、複数画像にわたるフリーフォームかつ正確なグラウンディングを実行できる最初のマルチ画像グラウンディングモデルであるMigicianを導入します。これをサポートするために、既存のデータセットから派生した複数画像グラウンディングタスク用のデータと、新しく生成されたフリーフォームグラウンディング命令に従うデータを含むMGrounding-630kデータセットを提供します。さらに、マルチ画像グラウンディング能力を評価するために特別に設計された包括的なベンチマークであるMIG-Benchを提案します。実験結果は、当社のモデルが、最高の既存のMLLMsを21.61%上回り、さらにははるかに大きな70Bモデルをも凌駕する、著しく優れたマルチ画像グラウンディング能力を達成していることを示しています。当社のコード、モデル、データセット、およびベンチマークはすべて完全にオープンソースです。

English

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced.

Migician: マルチモーダルな大規模言語モデルにおけるフリーフォームのマルチ画像グラウンディングの魔法を明らかにする

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

要旨

Summary

Support