MedAgentsBench:針對複雜醫療推理的思維模型與代理框架基準測試
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
March 10, 2025
作者: Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein
cs.AI
摘要
大型语言模型(LLMs)在现有的医疗问答基准测试中展现了令人瞩目的表现。这种卓越的性能使得对先进方法进行有意义的评估和区分变得越来越困难。我们提出了MedAgentsBench,这是一个专注于挑战性医疗问题的基准测试,这些问题需要多步骤的临床推理、诊断制定和治疗规划——在这些场景中,尽管当前模型在标准测试中表现出色,但仍存在困难。借鉴七个成熟的医疗数据集,我们的基准测试解决了现有评估中的三个关键局限:(1)简单问题普遍存在,即使基础模型也能取得高分;(2)不同研究间的采样和评估协议不一致;(3)缺乏对性能、成本和推理时间之间相互作用的系统分析。通过对多种基础模型和推理方法的实验,我们展示了最新的思维模型——DeepSeek R1和OpenAI o3——在复杂医疗推理任务中的卓越表现。此外,与传统的搜索方法相比,基于搜索的高级代理方法提供了更具前景的性能成本比。我们的分析揭示了在复杂问题上模型家族间显著的性能差距,并为不同的计算约束条件确定了最优模型选择。我们的基准测试和评估框架已公开发布于https://github.com/gersteinlab/medagents-benchmark。
English
Large Language Models (LLMs) have shown impressive performance on existing
medical question-answering benchmarks. This high performance makes it
increasingly difficult to meaningfully evaluate and differentiate advanced
methods. We present MedAgentsBench, a benchmark that focuses on challenging
medical questions requiring multi-step clinical reasoning, diagnosis
formulation, and treatment planning-scenarios where current models still
struggle despite their strong performance on standard tests. Drawing from seven
established medical datasets, our benchmark addresses three key limitations in
existing evaluations: (1) the prevalence of straightforward questions where
even base models achieve high performance, (2) inconsistent sampling and
evaluation protocols across studies, and (3) lack of systematic analysis of the
interplay between performance, cost, and inference time. Through experiments
with various base models and reasoning methods, we demonstrate that the latest
thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in
complex medical reasoning tasks. Additionally, advanced search-based agent
methods offer promising performance-to-cost ratios compared to traditional
approaches. Our analysis reveals substantial performance gaps between model
families on complex questions and identifies optimal model selections for
different computational constraints. Our benchmark and evaluation framework are
publicly available at https://github.com/gersteinlab/medagents-benchmark.Summary
AI-Generated Summary