시각 질의 응답을 위한 비전-언어 모델 선택 가이드 작업, 도메인 및 지식 유형을 넘어서

초록

시각 질의응답(VQA)는 사용자 경험을 돕기 위한 여러 응용 프로그램에서 핵심 사용 사례로 부상했으며, 특히 Vision-Language Models(VLMs)가 제로샷 추론에서 좋은 결과를 얻은 후에 더욱 중요해졌다. 그러나 실제 환경에서 표준화된 프레임워크를 사용하여 응용 프로그램 요구 사항에 따라 다양한 VLM을 평가하는 것은 여전히 어렵다. 본 논문은 시각 질의응답 작업에 특화된 VLM을 평가하기 위한 포괄적인 프레임워크를 소개한다. 우리는 확립된 VQA 벤치마크에서 유래된 새로운 데이터셋을 제시하며, 작업 유형, 응용 프로그램 도메인 및 지식 유형과 같은 세 가지 핵심 실용적 측면으로 주석이 달렸다. 또한 GPT-4o를 사용하여 개발된 다중 모달 평가 메트릭인 GoEval을 소개하며, 이는 인간 판단과 56.71%의 상관 관계를 달성했다. 최신 VLM 10개에 대한 실험 결과는 어떤 단일 모델도 모든 면에서 뛰어나지 않으며, 적절한 선택이 핵심 설계 결정이라는 것을 보여준다. Gemini-1.5-Pro 및 GPT-4o-mini와 같은 소유 모델은 일반적으로 다른 모델보다 우수한 성과를 보이지만, InternVL-2-8B 및 CogVLM-2-Llama-3-19B와 같은 오픈 소스 모델은 특정 맥락에서 경쟁력 있는 강점을 보여주며 추가적인 이점을 제공한다. 본 연구는 특정 작업 요구 사항과 자원 제약 조건에 따라 VLM을 선택하는 데 도움을 주며, 다른 시각-언어 작업으로 확장할 수도 있다.

English

Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

시각 질의 응답을 위한 비전-언어 모델 선택 가이드 작업, 도메인 및 지식 유형을 넘어서

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

초록

Summary

Support

Support