语言模型和第二意见使用案例：口袋专业人士

摘要

本研究测试了大型语言模型（LLMs）在专业决策中作为正式第二意见工具的作用，特别关注复杂医疗案例，即使经验丰富的医生也会寻求同行咨询的情况。该研究分析了来自Medscape的183个具有挑战性的医疗案例，历时20个月，测试了多个LLMs在与众包医生回应相比的表现。一个关键发现是在最新的基础模型中可能获得的高整体得分（>80%的准确率，与共识意见相比），这超过了同一临床案例（450页的患者档案、检测结果）上报告的大多数人类指标。研究评估了LLMs在简单案例（>81%的准确率）和复杂场景（43%的准确率）之间的性能差距，特别是在这些引发人类医生之间广泛讨论的案例中。研究表明，LLMs可能有助于生成全面的不同诊断，而不是作为主要诊断工具，潜在地有助于抵消临床决策中的认知偏见，减少认知负荷，从而消除一些医疗错误的来源。第二个比较性法律数据集（最高法院案例，N=21）的纳入为促进第二意见的AI使用提供了额外的实证背景，尽管这些法律挑战对LLMs来说分析起来相当容易。除了为LLMs准确性提供原始证据之外，该研究还汇总了一个新颖的基准，供他人评估LLMs和持不同意见的人类从业者之间高度争议的问题和答案的可靠性。这些结果表明，在专业环境中最佳部署LLMs的方式可能与当前强调自动化例行任务的方法有很大不同。

English

This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making, particularly focusing on complex medical cases where even experienced physicians seek peer consultation. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses. A key finding was the high overall score possible in the latest foundational models (>80% accuracy compared to consensus opinion), which exceeds most human metrics reported on the same clinical cases (450 pages of patient profiles, test results). The study rates the LLMs' performance disparity between straightforward cases (>81% accuracy) and complex scenarios (43% accuracy), particularly in these cases generating substantial debate among human physicians. The research demonstrates that LLMs may be valuable as generators of comprehensive differential diagnoses rather than as primary diagnostic tools, potentially helping to counter cognitive biases in clinical decision-making, reduce cognitive loads, and thus remove some sources of medical error. The inclusion of a second comparative legal dataset (Supreme Court cases, N=21) provides added empirical context to the AI use to foster second opinions, though these legal challenges proved considerably easier for LLMs to analyze. In addition to the original contributions of empirical evidence for LLM accuracy, the research aggregated a novel benchmark for others to score highly contested question and answer reliability between both LLMs and disagreeing human practitioners. These results suggest that the optimal deployment of LLMs in professional settings may differ substantially from current approaches that emphasize automation of routine tasks.

语言模型和第二意见使用案例：口袋专业人士

Language Models And A Second Opinion Use Case: The Pocket Professional

摘要

Summary

Support

Support