ChatPaper.aiChatPaper

迈向多模态大语言模型的视觉文本定位

Towards Visual Text Grounding of Multimodal Large Language Model

April 7, 2025
作者: Ming Li, Ruiyi Zhang, Jian Chen, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun
cs.AI

摘要

尽管多模态大语言模型(MLLMs)已取得显著进展,其在视觉文本定位,尤其是富含文本的文档图像上仍存在不容忽视的局限。诸如扫描表单和信息图表等文档图像,因其复杂的布局和丰富的文本内容,凸显了关键挑战。然而,现有基准测试未能全面应对这些挑战,主要集中于自然图像的视觉定位,而非富含文本的文档图像。为此,我们引入TRIG这一新任务,并设计了一套全新的指令数据集,旨在评估并提升MLLMs在文档问答中的文本丰富图像定位能力。具体而言,我们提出了一种OCR-LLM-人工交互流程,创建了800个手动标注的问答对作为基准,以及基于四个多样化数据集的90,000条合成数据作为大规模训练集。通过对多种MLLMs在我们提出的基准上进行全面评估,揭示了它们在处理富含文本图像时定位能力的显著不足。此外,我们提出了两种简单而有效的TRIG方法,分别基于通用指令微调和即插即用的高效嵌入技术。通过在合成数据集上微调MLLMs,它们在空间推理和定位能力上展现出令人期待的提升。
English
Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

Summary

AI-Generated Summary

PDF162April 11, 2025