MedXpertQA:专家级医学推理和理解的基准测试
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
January 30, 2025
作者: Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
cs.AI
摘要
我们介绍MedXpertQA,这是一个非常具有挑战性和全面性的基准,用于评估专家级医学知识和高级推理能力。MedXpertQA包含4,460个问题,涵盖17个专业领域和11个身体系统。它包括两个子集,Text用于文本评估,MM用于多模态评估。值得注意的是,MM引入了具有多样化图像和丰富临床信息的专家级考试问题,包括患者记录和检查结果,使其与仅生成自图像标题的简单问答对的传统医学多模态基准有所区别。MedXpertQA应用严格的过滤和增强措施来解决现有基准(如MedQA)存在的不足难度,并整合专业委员会问题以提高临床相关性和全面性。我们进行数据合成以减少数据泄漏风险,并进行多轮专家审查以确保准确性和可靠性。我们在MedXpertQA上评估了16个领先模型。此外,医学与现实决策紧密相连,为评估超越数学和代码的推理能力提供了丰富和具代表性的环境。为此,我们开发了一个面向推理的子集,以便评估类似o1模型的能力。
English
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to
evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA
includes 4,460 questions spanning 17 specialties and 11 body systems. It
includes two subsets, Text for text evaluation and MM for multimodal
evaluation. Notably, MM introduces expert-level exam questions with diverse
images and rich clinical information, including patient records and examination
results, setting it apart from traditional medical multimodal benchmarks with
simple QA pairs generated from image captions. MedXpertQA applies rigorous
filtering and augmentation to address the insufficient difficulty of existing
benchmarks like MedQA, and incorporates specialty board questions to improve
clinical relevance and comprehensiveness. We perform data synthesis to mitigate
data leakage risk and conduct multiple rounds of expert reviews to ensure
accuracy and reliability. We evaluate 16 leading models on MedXpertQA.
Moreover, medicine is deeply connected to real-world decision-making, providing
a rich and representative setting for assessing reasoning abilities beyond
mathematics and code. To this end, we develop a reasoning-oriented subset to
facilitate the assessment of o1-like models.Summary
AI-Generated Summary