ChatPaper.aiChatPaper

Inst-IT:通过显式视觉提示指令调整来提升多模态实例理解

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

December 4, 2024
作者: Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

随着指导调整技术的进步,大型多模态模型(LMMs)取得了重大突破。然而,尽管现有模型可以在整体层面上理解图像和视频,但仍然在需要更加细致的理解和对齐的实例级理解方面遇到困难。实例级理解至关重要,因为它侧重于我们最感兴趣的具体元素。令人振奋的是,现有研究发现,最先进的LMMs在提供明确的视觉线索时表现出强大的实例理解能力。受此启发,我们引入了一个由GPT-4o辅助的自动注释流程,通过明确的视觉提示来从图像和视频中提取实例级信息以进行实例指导。在这一流程基础上,我们提出了Inst-IT,这是一个通过明确的视觉提示指导调整来增强LMMs在实例理解方面的解决方案。Inst-IT包括一个用于诊断多模态实例级理解的基准测试、一个大规模指导调整数据集,以及一个连续的指导调整训练范式,以有效增强现有LMMs的时空实例理解能力。实验结果显示,在Inst-IT的推动下,我们的模型不仅在Inst-IT基准测试上取得了出色的表现,而且在各种通用图像和视频理解基准测试中也展现出显著的改进。这突显了我们的数据集不仅提升了实例级理解,还增强了通用图像和视频理解的整体能力。
English
Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Summary

AI-Generated Summary

PDF112December 5, 2024