MedAgentsBench:面向复杂医疗推理的思维模型与智能体框架基准测试
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
March 10, 2025
作者: Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein
cs.AI
摘要
大型语言模型(LLMs)在现有医疗问答基准测试中展现了卓越的性能。这种优异表现使得对先进方法进行有效评估和区分变得愈发困难。为此,我们推出了MedAgentsBench,一个专注于挑战性医疗问题的基准测试,这些问题需要多步骤的临床推理、诊断制定及治疗规划——在这些场景下,尽管当前模型在标准测试中表现强劲,但仍面临挑战。基于七个成熟的医疗数据集,我们的基准测试解决了现有评估中的三个关键局限:(1)简单问题占比过高,导致基础模型也能取得高分;(2)不同研究间采样与评估协议不一致;(3)缺乏对性能、成本与推理时间之间相互作用的系统分析。通过对多种基础模型及推理方法的实验,我们证明了最新思维模型DeepSeek R1和OpenAI o3在复杂医疗推理任务中表现尤为突出。此外,相较于传统方法,基于搜索的高级代理方法展现了更优的性能成本比。我们的分析揭示了在复杂问题上模型家族间显著的性能差距,并为不同计算约束条件识别了最优模型选择。我们的基准测试与评估框架已公开于https://github.com/gersteinlab/medagents-benchmark。
English
Large Language Models (LLMs) have shown impressive performance on existing
medical question-answering benchmarks. This high performance makes it
increasingly difficult to meaningfully evaluate and differentiate advanced
methods. We present MedAgentsBench, a benchmark that focuses on challenging
medical questions requiring multi-step clinical reasoning, diagnosis
formulation, and treatment planning-scenarios where current models still
struggle despite their strong performance on standard tests. Drawing from seven
established medical datasets, our benchmark addresses three key limitations in
existing evaluations: (1) the prevalence of straightforward questions where
even base models achieve high performance, (2) inconsistent sampling and
evaluation protocols across studies, and (3) lack of systematic analysis of the
interplay between performance, cost, and inference time. Through experiments
with various base models and reasoning methods, we demonstrate that the latest
thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in
complex medical reasoning tasks. Additionally, advanced search-based agent
methods offer promising performance-to-cost ratios compared to traditional
approaches. Our analysis reveals substantial performance gaps between model
families on complex questions and identifies optimal model selections for
different computational constraints. Our benchmark and evaluation framework are
publicly available at https://github.com/gersteinlab/medagents-benchmark.Summary
AI-Generated Summary