VisOnlyQA:大型視覺語言模型在感知幾何信息方面仍然存在困難
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
December 1, 2024
作者: Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
cs.AI
摘要
在圖像中對視覺信息的誤解(即視覺感知錯誤)仍然是大型視覺語言模型(LVLMs)中錯誤的主要來源。儘管進一步的分析至關重要,但在評估LVLMs的視覺感知方面存在數據集不足的問題。在這項工作中,我們介紹了VisOnlyQA,這是一個新的數據集,旨在直接評估LVLMs對科學圖中幾何和數值信息的視覺感知能力。我們的數據集使我們能夠分析LVLMs對細粒度視覺信息的感知,獨立於推理等其他能力。VisOnlyQA的評估集包括四類圖形上的12個任務中的1,200個多項選擇問題。我們還提供包含70k個實例的合成訓練數據。我們在VisOnlyQA上的實驗突出了以下發現:(i)我們評估的20個LVLMs,包括GPT-4o和Gemini 1.5 Pro,在VisOnlyQA的視覺感知任務上表現不佳,而人類表現幾乎完美。(ii)在合成訓練數據上進行微調顯示了增強LVLMs視覺感知的潛力,但觀察到的改進僅限於某些任務和特定模型。(iii)更強大的語言模型改善了LVLMs的視覺感知。總之,我們的實驗表明,應該改進訓練數據和模型架構,以增強LVLMs的視覺感知能力。數據集、代碼和模型響應可在https://github.com/psunlpgroup/VisOnlyQA找到。
English
Errors in understanding visual information in images (i.e., visual perception
errors) remain a major source of mistakes in Large Vision Language Models
(LVLMs). While further analysis is essential, there is a deficiency in datasets
for evaluating the visual perception of LVLMs. In this work, we introduce
VisOnlyQA, a new dataset designed to directly evaluate the visual perception
capabilities of LVLMs on questions about geometric and numerical information in
scientific figures. Our dataset enables us to analyze the visual perception of
LVLMs for fine-grained visual information, independent of other capabilities
such as reasoning. The evaluation set of VisOnlyQA includes 1,200
multiple-choice questions in 12 tasks on four categories of figures. We also
provide synthetic training data consisting of 70k instances. Our experiments on
VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including
GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in
VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on
synthetic training data demonstrates the potential for enhancing the visual
perception of LVLMs, but observed improvements are limited to certain tasks and
specific models. (iii) Stronger language models improve the visual perception
of LVLMs. In summary, our experiments suggest that both training data and model
architectures should be improved to enhance the visual perception capabilities
of LVLMs. The datasets, code, and model responses are provided at
https://github.com/psunlpgroup/VisOnlyQA.Summary
AI-Generated Summary