笔记助力专注?迈向多轮多模态对话学习
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
March 10, 2025
作者: Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu
cs.AI
摘要
多模态大语言模型(MLLMs)基于大规模预训练的视觉塔和语言模型构建,已在多模态理解方面展现出卓越能力。然而,现有大多数MLLMs仅在单轮视觉问答任务上进行训练,未能准确反映现实世界中的人类对话。本文中,我们引入了MMDiag,一个多轮多模态对话数据集。该数据集通过精心设计的规则与GPT辅助协作生成,其特点在于问题之间、问题与图像之间以及不同图像区域之间具有强相关性,从而更贴近现实场景。MMDiag为多轮多模态对话学习提供了强有力的基准,并对MLLMs的定位与推理能力提出了更多挑战。此外,受人类视觉处理机制启发,我们提出了DiagNote,一款具备多模态定位与推理能力的MLLM。DiagNote由两个相互作用的模块(Deliberate和Gaze)组成,在多轮对话中分别执行思维链与标注任务。我们通过实证研究展示了DiagNote在定位及视觉与语言信息联合处理与推理方面相较于现有MLLMs的优势。
English
Multimodal large language models (MLLMs), built on large-scale pre-trained
vision towers and language models, have shown great capabilities in multimodal
understanding. However, most existing MLLMs are trained on single-turn vision
question-answering tasks, which do not accurately reflect real-world human
conversations. In this paper, we introduce MMDiag, a multi-turn multimodal
dialogue dataset. This dataset is collaboratively generated through
deliberately designed rules and GPT assistance, featuring strong correlations
between questions, between questions and images, and among different image
regions; thus aligning more closely with real-world scenarios. MMDiag serves as
a strong benchmark for multi-turn multimodal dialogue learning and brings more
challenges to the grounding and reasoning capabilities of MLLMs. Further,
inspired by human vision processing, we present DiagNote, an MLLM equipped with
multimodal grounding and reasoning capabilities. DiagNote consists of two
modules (Deliberate and Gaze) interacting with each other to perform
Chain-of-Thought and annotations respectively, throughout multi-turn dialogues.
We empirically demonstrate the advantages of DiagNote in both grounding and
jointly processing and reasoning with vision and language information over
existing MLLMs.Summary
AI-Generated Summary