ChatPaper.aiChatPaper

引導視覺問答模型的選擇:跨任務、領域和知識類型

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

September 14, 2024
作者: Neelabh Sinha, Vinija Jain, Aman Chadha
cs.AI

摘要

視覺問答(VQA)已成為幾個應用中的關鍵用例,以幫助用戶體驗,特別是在視覺語言模型(VLMs)在零-shot推理中取得良好結果後。然而,在實際環境中使用標準化框架評估不同VLMs以滿足應用需求仍然具有挑戰性。本文介紹了一個針對實際環境中VQA任務評估VLMs的全面框架。我們提出了一個新的數據集,從已建立的VQA基準中提取,並標註了任務類型、應用領域和知識類型,這三個任務可能存在差異的關鍵實際方面。我們還介紹了GoEval,一個使用GPT-4o開發的多模態評估指標,與人類判斷達到了56.71%的相關性。我們對十個最先進的VLMs進行的實驗顯示,沒有一個單一模型能在所有情況下表現出色,因此適當的選擇成為關鍵設計決策。專有模型如Gemini-1.5-Pro和GPT-4o-mini通常優於其他模型,儘管像InternVL-2-8B和CogVLM-2-Llama-3-19B這樣的開源模型在特定情境中展示出競爭優勢,同時提供額外的優勢。本研究指導基於特定任務需求和資源限制來選擇VLMs,並且還可以擴展到其他視覺語言任務。
English
Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

Summary

AI-Generated Summary

PDF92November 16, 2024