자연어 벤치: 자연 적대적 샘플에서 시각-언어 모델 평가

초록

비전-언어 모델(VLMs)은 최근 복잡한 시각-언어적 추론을 평가하는 시각-질의-응답(VQA) 벤치마크에서 상당한 진전을 이루었습니다. 그러나 이러한 모델이 실제로 효과적인가요? 본 연구에서는 VLMs가 여전히 인간이 쉽게 대답할 수 있는 자연 이미지와 질문에서 어려움을 겪는 것을 보여줍니다. 이를 자연 적대적 샘플이라고 부릅니다. 또한 CLIP 및 ChatGPT와 같은 외부 모델을 사용하여 자연 이미지-텍스트 말뭉치에서 이러한 VQA 샘플을 생성하는 것이 놀랍게도 쉽다는 것을 발견했습니다. 우리는 1만 개의 인간 확인된 VQA 샘플로 VLMs를 신뢰성 있게 평가하기 위한 새로운 NaturalBench 벤치마크를 수집하기 위한 반자동화된 접근 방식을 제안합니다. 중요한 것은 각 질문을 서로 다른 답변을 내놓는 두 이미지와 짝지어 시각 중심적 설계를 채택하여, 이미지를 사용하지 않고 눈가리고 답하는 솔루션을 방지합니다. 이로써 NaturalBench는 공통 감각 사전을 사용하여 해결할 수 있는 이전 벤치마크보다 더 어려워집니다. 우리는 LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL 및 심지어 GPT-4o와 같은 53개의 최첨단 VLMs를 NaturalBench에서 평가하여, 이 모델들이 인간 성능(90% 이상)의 50%~70%를 뒤쳐진다는 것을 보여줍니다. NaturalBench가 왜 어려운지에 대해 두 가지 측면에서 분석합니다: (1) 합성성: NaturalBench를 해결하기 위해서는 속성 바인딩, 객체 관계, 논리 및 계산과 같은 다양한 시각-언어적 기술이 필요합니다. 이를 위해 각 NaturalBench 샘플에 1에서 8개의 기술 태그를 부여하여 세밀하게 평가합니다. (2) 편향: NaturalBench는 VLMs의 심각한 편향을 드러냅니다. 모델들은 종종 이미지와 관계없이 동일한 답변을 선택합니다. 마지막으로, 우리는 100단어 이상의 긴 캡션 및 중국어, 힌디어와 같은 비영어 언어를 포함한 다양한 데이터 소스에 우리의 벤치마크 선별 방법을 적용하여, VLMs의 동적 평가에 대한 잠재력을 강조합니다.

English

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

자연어 벤치: 자연 적대적 샘플에서 시각-언어 모델 평가

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

초록

Support