視覺語言模型評估的挑戰性多選題自動生成
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
January 6, 2025
作者: Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy
cs.AI
摘要
視覺語言模型(VLMs)的快速發展要求嚴格和可靠的評估。然而,目前的視覺問答(VQA)基準常常依賴於開放式問題,這使得由於自然語言回答的變異性,準確評估變得困難。為了解決這個問題,我們引入了AutoConverter,這是一個主動式框架,可以自動將這些開放式問題轉換為多選題格式,從而實現客觀評估,同時減少了昂貴的問題創建過程。我們的實驗表明,AutoConverter能夠生成正確且具有挑戰性的多選題,VLMs對這些問題的準確性與人類創建的問題相比保持一致或更低。使用AutoConverter,我們構建了VMCBench,這是一個基於將20個現有的VQA數據集轉換為統一的多選題格式而創建的基準,共計9,018個問題。我們在VMCBench上全面評估了33個最先進的VLMs,為可擴展、一致和可重現的VLM評估設定了新標準。
English
The rapid development of vision language models (VLMs) demands rigorous and
reliable evaluation. However, current visual question answering (VQA)
benchmarks often depend on open-ended questions, making accurate evaluation
difficult due to the variability in natural language responses. To address
this, we introduce AutoConverter, an agentic framework that automatically
converts these open-ended questions into multiple-choice format, enabling
objective evaluation while reducing the costly question creation process. Our
experiments demonstrate that AutoConverter can generate correct and challenging
multiple-choice questions, with VLMs demonstrating consistently similar or
lower accuracy on these questions compared to human-created ones. Using
AutoConverter, we construct VMCBench, a benchmark created by transforming 20
existing VQA datasets into a unified multiple-choice format, totaling 9,018
questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench,
setting a new standard for scalable, consistent, and reproducible VLM
evaluation.Summary
AI-Generated Summary