VisOnlyQA: 대형 비전 언어 모델은 여전히 기하학적 정보의 시각적 인식에 어려움을 겪습니다.

초록

이미지에서 시각 정보를 이해하는 데 발생하는 오류(즉, 시각 지각 오류)는 대형 비전 언어 모델(LVLMs)에서 실수의 주요 원인으로 남아 있습니다. 추가 분석이 중요하지만, LVLMs의 시각 지각을 평가하기 위한 데이터셋에는 부족함이 있습니다. 본 연구에서는 과학적 그림의 기하학적 및 숫자 정보에 관한 질문에 대한 LVLMs의 시각 지각 능력을 직접 평가하기 위해 설계된 새로운 데이터셋인 VisOnlyQA를 소개합니다. 우리의 데이터셋은 LVLMs의 시각 지각을 다른 능력(예: 추론)과 독립적으로 분석할 수 있도록 합니다. VisOnlyQA의 평가 세트에는 네 가지 범주의 그림에 대한 12가지 작업에서 1,200개의 객관식 질문이 포함되어 있습니다. 또한 70,000개의 인스턴스로 구성된 합성 훈련 데이터를 제공합니다. VisOnlyQA에서의 실험 결과는 다음과 같은 결과를 강조합니다: (i) GPT-4o 및 Gemini 1.5 Pro를 포함한 20개의 LVLMs는 VisOnlyQA의 시각 지각 작업에서 제대로 작동하지 않지만, 인간의 성능은 거의 완벽합니다. (ii) 합성 훈련 데이터에 대한 미세 조정은 LVLMs의 시각 지각을 향상시킬 잠재력을 보여줍니다. 그러나 관찰된 개선은 특정 작업 및 특정 모델에 한정됩니다. (iii) 강력한 언어 모델은 LVLMs의 시각 지각을 향상시킵니다. 요약하면, 우리의 실험은 LVLMs의 시각 지각 능력을 향상시키기 위해 훈련 데이터와 모델 아키텍처 모두 개선되어야 함을 시사합니다. 데이터셋, 코드 및 모델 응답은 https://github.com/psunlpgroup/VisOnlyQA에서 제공됩니다.

English

Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

VisOnlyQA: 대형 비전 언어 모델은 여전히 기하학적 정보의 시각적 인식에 어려움을 겪습니다.

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

초록

Summary

Support