視覺語言模型評估的挑戰性多選題自動生成

摘要

視覺語言模型（VLMs）的快速發展要求嚴格和可靠的評估。然而，目前的視覺問答（VQA）基準常常依賴於開放式問題，這使得由於自然語言回答的變異性，準確評估變得困難。為了解決這個問題，我們引入了AutoConverter，這是一個主動式框架，可以自動將這些開放式問題轉換為多選題格式，從而實現客觀評估，同時減少了昂貴的問題創建過程。我們的實驗表明，AutoConverter能夠生成正確且具有挑戰性的多選題，VLMs對這些問題的準確性與人類創建的問題相比保持一致或更低。使用AutoConverter，我們構建了VMCBench，這是一個基於將20個現有的VQA數據集轉換為統一的多選題格式而創建的基準，共計9,018個問題。我們在VMCBench上全面評估了33個最先進的VLMs，為可擴展、一致和可重現的VLM評估設定了新標準。

English

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

視覺語言模型評估的挑戰性多選題自動生成

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

摘要

Support