MedAgentsBench：針對複雜醫療推理的思維模型與代理框架基準測試

摘要

大型语言模型（LLMs）在现有的医疗问答基准测试中展现了令人瞩目的表现。这种卓越的性能使得对先进方法进行有意义的评估和区分变得越来越困难。我们提出了MedAgentsBench，这是一个专注于挑战性医疗问题的基准测试，这些问题需要多步骤的临床推理、诊断制定和治疗规划——在这些场景中，尽管当前模型在标准测试中表现出色，但仍存在困难。借鉴七个成熟的医疗数据集，我们的基准测试解决了现有评估中的三个关键局限：（1）简单问题普遍存在，即使基础模型也能取得高分；（2）不同研究间的采样和评估协议不一致；（3）缺乏对性能、成本和推理时间之间相互作用的系统分析。通过对多种基础模型和推理方法的实验，我们展示了最新的思维模型——DeepSeek R1和OpenAI o3——在复杂医疗推理任务中的卓越表现。此外，与传统的搜索方法相比，基于搜索的高级代理方法提供了更具前景的性能成本比。我们的分析揭示了在复杂问题上模型家族间显著的性能差距，并为不同的计算约束条件确定了最优模型选择。我们的基准测试和评估框架已公开发布于https://github.com/gersteinlab/medagents-benchmark。

English

Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at https://github.com/gersteinlab/medagents-benchmark.

MedAgentsBench：針對複雜醫療推理的思維模型與代理框架基準測試

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

摘要

Summary

Support

Support