ChatPaper.aiChatPaper

VLM^2-Bench:深入探究视觉语言模型如何隐式关联显式匹配的视觉线索

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

February 17, 2025
作者: Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R., Fung
cs.AI

摘要

在日常生活中,视觉上关联匹配线索是一项关键能力,例如根据线索在多张照片中识别出同一个人,即使并不知道其具体身份。尽管视觉-语言模型(VLMs)拥有广泛的知识,但它们是否能够执行这一基础任务仍很大程度上未被探索。为此,我们提出了VLM^2-Bench,一个旨在评估VLMs能否视觉关联匹配线索的基准测试,包含9个子任务和超过3000个测试案例。通过对八个开源VLMs及GPT-4o的全面评估,以及对多种语言侧和视觉侧提示方法的深入分析,我们得出了八项关键发现。我们识别出模型在关联视觉线索能力上的关键挑战,揭示了一个显著的性能差距,即便是GPT-4o也落后人类34.80%。基于这些洞察,我们倡导:(i) 增强核心视觉能力,以提高适应性并减少对先验知识的依赖;(ii) 建立更清晰的原则,将基于语言的推理整合到以视觉为中心的任务中,以避免不必要的偏差;(iii) 转变视觉-文本训练范式,促进模型独立构建和推断视觉线索间关系的能力。
English
Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM^2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models' ability to independently structure and infer relationships among visual cues.

Summary

AI-Generated Summary

PDF292February 24, 2025