Inst-IT:通過明確的視覺提示指令調整來增強多模式實例理解

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

December 4, 2024
作者: Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

隨著指導調整技術的進步,大型多模型(LMMs)已取得重大突破。然而,儘管現有模型能夠在整體層面上理解圖像和視頻,但在需要更微妙理解和對齊的實例級理解方面仍然存在困難。實例級理解至關重要,因為它專注於我們最感興趣的具體元素。令人振奮的是,現有研究發現,當提供明確的視覺提示時,最先進的LMMs表現出強大的實例理解能力。受此激勵,我們引入了一個由GPT-4o協助的自動標註流程,通過明確的視覺提示來從圖像和視頻中提取實例級信息,以進行實例引導。在這個流程的基礎上,我們提出了Inst-IT,一個通過明確的視覺提示指導調整來增強LMMs在實例理解方面的解決方案。Inst-IT包括一個用於診斷多模實例級理解的基準、一個大規模指導調整數據集,以及一個連續的指導調整訓練範式,以有效增強現有LMMs的時空實例理解能力。實驗結果顯示,在Inst-IT的提升下,我們的模型不僅在Inst-IT基準上取得優異表現,還在各種通用圖像和視頻理解基準上顯示出顯著改進。這突顯了我們的數據集不僅提升了實例級理解,還增強了通用圖像和視頻理解的整體能力。
English
Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Summary

AI-Generated Summary

PDF112December 5, 2024