NaturalBench：在自然對抗樣本上評估視覺語言模型

摘要

視覺語言模型（VLMs）在最近的視覺問答（VQA）基準測試中取得了顯著進展，該測試評估了複雜的視覺-語言推理能力。然而，這些模型是否真的有效呢？在這項研究中，我們展示了VLMs在處理人類可以輕鬆回答的自然圖像和問題時仍然存在困難，我們將其稱為自然對抗樣本。我們還發現使用像CLIP和ChatGPT這樣的現成模型很容易生成這些VQA樣本，這些樣本來自自然圖像-文本語料庫。我們提出了一種半自動化方法來收集一個新的基準測試，NaturalBench，以可靠地評估具有10,000個經人工驗證的VQA樣本的VLMs。至關重要的是，我們採用了以視覺為中心的設計，將每個問題與兩個產生不同答案的圖像配對，防止盲目解決方案在不使用圖像的情況下回答。這使得NaturalBench比以往可以通過常識先驗知識解決的基準測試更具挑戰性。我們在NaturalBench上評估了53個最先進的VLMs，顯示像LLaVA-OneVision、Cambrian-1、Llama3.2-Vision、Molmo、Qwen2-VL甚至GPT-4o這樣的模型在人類表現（超過90%）的50%-70%之後。我們從兩個角度分析了NaturalBench為何困難：（1）組成性：解決NaturalBench需要多樣的視覺-語言技能，包括理解屬性綁定、物體關係以及像邏輯和計數這樣的高級推理。為此，與先前每個樣本使用單個標籤不同，我們為每個NaturalBench樣本標記1到8個技能標籤進行細粒度評估。（2）偏見：NaturalBench暴露了VLMs中的嚴重偏見，因為模型通常無論圖像如何都會選擇相同的答案。最後，我們將我們的基準測試策劃方法應用於各種數據來源，包括長說明（超過100個詞）和中文、印地語等非英語語言，突顯了其對VLMs的動態評估潛力。

English

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

NaturalBench：在自然對抗樣本上評估視覺語言模型

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

摘要

Summary

Support

Support