视觉语言模型评估中具有挑战性的多项选择题的自动生成
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
January 6, 2025
作者: Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy
cs.AI
摘要
视觉语言模型(VLMs)的快速发展要求严格和可靠的评估。然而,当前的视觉问答(VQA)基准往往依赖于开放式问题,由于自然语言回答的差异性,精确评估变得困难。为了解决这个问题,我们引入了AutoConverter,这是一个主动框架,可以自动将这些开放式问题转换为多项选择题格式,从而实现客观评估,同时减少了昂贵的问题创建过程。我们的实验表明,AutoConverter能够生成正确且具有挑战性的多项选择题,VLMs在这些问题上的准确率通常与人类创建的问题相似或更低。利用AutoConverter,我们构建了VMCBench,这是一个基准,通过将20个现有的VQA数据集转换为统一的多项选择题格式,共计9,018个问题。我们在VMCBench上全面评估了33种最先进的VLMs,为可伸缩、一致和可重现的VLM评估设立了新标准。
English
The rapid development of vision language models (VLMs) demands rigorous and
reliable evaluation. However, current visual question answering (VQA)
benchmarks often depend on open-ended questions, making accurate evaluation
difficult due to the variability in natural language responses. To address
this, we introduce AutoConverter, an agentic framework that automatically
converts these open-ended questions into multiple-choice format, enabling
objective evaluation while reducing the costly question creation process. Our
experiments demonstrate that AutoConverter can generate correct and challenging
multiple-choice questions, with VLMs demonstrating consistently similar or
lower accuracy on these questions compared to human-created ones. Using
AutoConverter, we construct VMCBench, a benchmark created by transforming 20
existing VQA datasets into a unified multiple-choice format, totaling 9,018
questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench,
setting a new standard for scalable, consistent, and reproducible VLM
evaluation.Summary
AI-Generated Summary