VisOnlyQA:大型視覺語言模型在感知幾何信息方面仍然存在困難

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

December 1, 2024
作者: Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
cs.AI

摘要

在圖像中對視覺信息的誤解(即視覺感知錯誤)仍然是大型視覺語言模型(LVLMs)中錯誤的主要來源。儘管進一步的分析至關重要,但在評估LVLMs的視覺感知方面存在數據集不足的問題。在這項工作中,我們介紹了VisOnlyQA,這是一個新的數據集,旨在直接評估LVLMs對科學圖中幾何和數值信息的視覺感知能力。我們的數據集使我們能夠分析LVLMs對細粒度視覺信息的感知,獨立於推理等其他能力。VisOnlyQA的評估集包括四類圖形上的12個任務中的1,200個多項選擇問題。我們還提供包含70k個實例的合成訓練數據。我們在VisOnlyQA上的實驗突出了以下發現:(i)我們評估的20個LVLMs,包括GPT-4o和Gemini 1.5 Pro,在VisOnlyQA的視覺感知任務上表現不佳,而人類表現幾乎完美。(ii)在合成訓練數據上進行微調顯示了增強LVLMs視覺感知的潛力,但觀察到的改進僅限於某些任務和特定模型。(iii)更強大的語言模型改善了LVLMs的視覺感知。總之,我們的實驗表明,應該改進訓練數據和模型架構,以增強LVLMs的視覺感知能力。數據集、代碼和模型響應可在https://github.com/psunlpgroup/VisOnlyQA找到。
English
Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

Summary

AI-Generated Summary

PDF72December 3, 2024