ビジョン言語モデル評価のための難解な多肢選択問題の自動生成

要旨

ビジョン言語モデル（VLMs）の急速な発展には、厳密で信頼性のある評価が求められます。ただし、現在のビジュアル質問応答（VQA）ベンチマークはしばしばオープンエンドの質問に依存しており、自然言語応答の変動があるため正確な評価が困難です。この問題に対処するために、我々はAutoConverterを導入します。これは、これらのオープンエンドの質問を自動的に多肢選択形式に変換するエージェントフレームワークであり、客観的な評価を可能にし、かつコストのかかる質問作成プロセスを削減します。私たちの実験では、AutoConverterが正確で challenging な多肢選択問題を生成できることを示し、VLMsがこれらの質問に対して人間が作成したものと比較して一貫して類似または低い精度を示すことを示しました。AutoConverterを使用して、20の既存のVQAデータセットを統一された多肢選択形式に変換して作成したベンチマークであるVMCBenchを構築し、合計9,018の質問が含まれています。我々はVMCBenchで33の最先端のVLMsを包括的に評価し、スケーラブルで一貫性があり再現性のあるVLMの評価の新基準を設定しました。

English

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

ビジョン言語モデル評価のための難解な多肢選択問題の自動生成

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

要旨

Summary

Support