ChatPaper.aiChatPaper

VisOnlyQA:大型视觉语言模型仍然在感知几何信息方面存在困难

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

December 1, 2024
作者: Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
cs.AI

摘要

在图像中对视觉信息的理解错误(即视觉感知错误)仍然是大型视觉语言模型(LVLMs)中错误的主要来源。虽然进一步分析至关重要,但目前缺乏用于评估LVLMs视觉感知的数据集。在这项工作中,我们介绍了VisOnlyQA,这是一个新数据集,旨在直接评估LVLMs对科学图表中几何和数值信息的视觉感知能力。我们的数据集使我们能够分析LVLMs对细粒度视觉信息的感知,独立于推理等其他能力。VisOnlyQA的评估集包括1,200个涉及四类图表的12个任务的多项选择问题。我们还提供了包含70,000个实例的合成训练数据。我们在VisOnlyQA上的实验突出了以下发现:(i)我们评估的20个LVLMs,包括GPT-4o和Gemini 1.5 Pro,在VisOnlyQA的视觉感知任务上表现不佳,而人类表现几乎完美。(ii)在合成训练数据上微调显示了增强LVLMs视觉感知的潜力,但观察到的改进仅限于特定任务和特定模型。(iii)更强大的语言模型可以提高LVLMs的视觉感知。总之,我们的实验表明,应该改进训练数据和模型架构以增强LVLMs的视觉感知能力。数据集、代码和模型响应可在https://github.com/psunlpgroup/VisOnlyQA获取。
English
Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

Summary

AI-Generated Summary

PDF82December 3, 2024