朝向多模态大语言模型的视觉文本定位
Towards Visual Text Grounding of Multimodal Large Language Model
April 7, 2025
作者: Ming Li, Ruiyi Zhang, Jian Chen, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun
cs.AI
摘要
儘管多模態大型語言模型(MLLMs)已有顯著進展,但其在視覺文本定位方面仍存在不可忽視的侷限性,尤其是在處理富含文本的文件圖像時。文件圖像,如掃描表格和信息圖表,因其複雜的版面設計和文本內容,凸顯了關鍵挑戰。然而,現有的基準測試並未充分應對這些挑戰,因為它們主要聚焦於自然圖像上的視覺定位,而非富含文本的文件圖像。因此,為彌補這一差距,我們引入了TRIG,這是一項新穎的任務,並設計了一個全新的指令數據集,旨在評估和提升MLLMs在文件問答中的文本豐富圖像定位能力。具體而言,我們提出了一條OCR-LLM-人工交互的流程,創建了800個手動標註的問答對作為基準,以及一個基於四個多樣化數據集、包含90%合成數據的大規模訓練集。對我們提出的基準進行多種MLLMs的全面評估,揭示了它們在處理文本豐富圖像時的定位能力存在顯著不足。此外,我們提出了兩種簡單而有效的TRIG方法,分別基於通用指令微調和即插即用的高效嵌入技術。通過在我們的合成數據集上微調MLLMs,它們在空間推理和定位能力方面展現出令人鼓舞的改進。
English
Despite the existing evolution of Multimodal Large Language Models (MLLMs), a
non-neglectable limitation remains in their struggle with visual text
grounding, especially in text-rich images of documents. Document images, such
as scanned forms and infographics, highlight critical challenges due to their
complex layouts and textual content. However, current benchmarks do not fully
address these challenges, as they mostly focus on visual grounding on natural
images, rather than text-rich document images. Thus, to bridge this gap, we
introduce TRIG, a novel task with a newly designed instruction dataset for
benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs
in document question-answering. Specifically, we propose an OCR-LLM-human
interaction pipeline to create 800 manually annotated question-answer pairs as
a benchmark and a large-scale training set of 90$ synthetic data based on four
diverse datasets. A comprehensive evaluation of various MLLMs on our proposed
benchmark exposes substantial limitations in their grounding capability on
text-rich images. In addition, we propose two simple and effective TRIG methods
based on general instruction tuning and plug-and-play efficient embedding,
respectively. By finetuning MLLMs on our synthetic dataset, they promisingly
improve spatial reasoning and grounding capabilities.Summary
AI-Generated Summary