WorldMedQA-V: 다중언어, 다중모달 의료 검사 데이터셋으로 다중모달 언어 모델 평가

초록

멀티모달/시각 언어 모델(VLMs)은 전 세계적으로 의료 분야에서 점점 더 많이 활용되고 있으며, 그 안전성, 효과성, 공정성을 보장하기 위한 견고한 기준이 필요하다. 국가 의료 시험에서 파생된 객관식 질문과 답변(QA) 데이터셋은 오랫동안 가치 있는 평가 도구로 사용되어 왔지만, 기존 데이터셋은 주로 텍스트만 포함되어 있으며 언어와 국가의 한정된 부분에서만 제공되고 있다. 이러한 도전에 대처하기 위해, 우리는 의료 분야에서 VLMs를 평가하기 위해 설계된 업데이트된 다국어 멀티모달 벤치마킹 데이터셋인 WorldMedQA-V를 제시한다. WorldMedQA-V에는 네 개국(브라질, 이스라엘, 일본, 스페인)의 568개의 레이블이 지정된 객관식 QA와 각각의 원래 언어와 해당하는 영어 번역이 포함된 568개의 의료 이미지가 포함되어 있다. 일반 오픈 및 폐쇄 소스 모델의 기준 성능은 현지 언어와 영어 번역, 그리고 모델에 이미지를 제공하거나 제공하지 않은 상태로 제공된다. WorldMedQA-V 벤치마크는 AI 시스템을 배포되는 다양한 의료 환경에 더 잘 맞추어, 보다 공정하고 효과적이며 대표적인 응용 프로그램을 육성하는 것을 목표로 한다.

English

Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.

WorldMedQA-V: 다중언어, 다중모달 의료 검사 데이터셋으로 다중모달 언어 모델 평가

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

초록

Summary

Support