ChatPaper.aiChatPaper

VisualSimpleQA:面向大视觉语言模型在事实性问答中解耦评估的基准

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

March 9, 2025
作者: Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu
cs.AI

摘要

大型视觉语言模型(LVLMs)已展现出显著的成就,但在事实探寻问答(QA)中,生成非事实性回答的现象依然普遍。当前的多模态事实探寻基准主要集中于将模型输出与标准答案进行对比,对模态特定模块的性能提供有限洞察。为填补这一空白,我们引入了VisualSimpleQA,这是一个具备两大关键特征的多模态事实探寻基准。首先,它支持对LVLMs在视觉与语言模态上进行简化且解耦的评估。其次,该基准融入了明确的难度标准,以指导人工标注,并便于提取出更具挑战性的子集——VisualSimpleQA-hard。对15个LVLMs的实验表明,即便是如GPT-4o这样的顶尖模型,在VisualSimpleQA上的多模态事实探寻QA中正确率也仅略高于60%,而在VisualSimpleQA-hard上则刚过30%。此外,跨这些模型的解耦评估揭示出视觉与语言模块均存在显著的改进空间。该数据集可通过https://huggingface.co/datasets/WYLing/VisualSimpleQA获取。
English
Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

Summary

AI-Generated Summary

PDF94March 12, 2025