LEOPARD:一個針對文本豐富的多圖任務的視覺語言模型
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
October 2, 2024
作者: Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Meng Jiang, Dong Yu
cs.AI
摘要
在現實世界的應用中,文字豐富的圖像,其中文字作為主要的視覺元素,引導整體理解,是普遍存在的,例如演示幻燈片、掃描文件和網頁快照。涉及多個文字豐富圖像的任務尤其具有挑戰性,因為這些任務不僅需要理解單個圖像的內容,還需要推理跨多個視覺輸入的相互關係和邏輯流。儘管這些情景的重要性,但目前的多模式大型語言模型(MLLMs)在處理此類任務時遇到了兩個關鍵挑戰:(1)缺乏針對文字豐富多圖像情景的高質量指導調整數據集,以及(2)在圖像分辨率和視覺特徵序列長度之間難以平衡。為應對這些挑戰,我們提出\OurMethod,這是一個專為處理涉及多個文字豐富圖像的視覺-語言任務而設計的MLLM。首先,我們精心策劃了約一百萬個高質量的多模式指導調整數據,針對文字豐富、多圖像情景進行了定制。其次,我們開發了一個自適應高分辨率多圖像編碼模塊,根據輸入圖像的原始長寬比和分辨率動態優化視覺序列長度的分配。在廣泛的基準測試中進行的實驗表明,我們的模型在文字豐富、多圖像評估方面具有優越的能力,並在一般領域評估中表現出競爭力。
English
Text-rich images, where text serves as the central visual element guiding the
overall understanding, are prevalent in real-world applications, such as
presentation slides, scanned documents, and webpage snapshots. Tasks involving
multiple text-rich images are especially challenging, as they require not only
understanding the content of individual images but reasoning about
inter-relationships and logical flows across multiple visual inputs. Despite
the importance of these scenarios, current multimodal large language models
(MLLMs) struggle to handle such tasks due to two key challenges: (1) the
scarcity of high-quality instruction tuning datasets for text-rich multi-image
scenarios, and (2) the difficulty in balancing image resolution with visual
feature sequence length. To address these challenges, we propose \OurMethod, a
MLLM designed specifically for handling vision-language tasks involving
multiple text-rich images. First, we curated about one million high-quality
multimodal instruction-tuning data, tailored to text-rich, multi-image
scenarios. Second, we developed an adaptive high-resolution multi-image
encoding module to dynamically optimize the allocation of visual sequence
length based on the original aspect ratios and resolutions of the input images.
Experiments across a wide range of benchmarks demonstrate our model's superior
capabilities in text-rich, multi-image evaluations and competitive performance
in general domain evaluations.Summary
AI-Generated Summary