ChatPaper.aiChatPaper

語言模型與第二意見使用案例:Pocket 專業人士

Language Models And A Second Opinion Use Case: The Pocket Professional

October 27, 2024
作者: David Noever
cs.AI

摘要

本研究測試了大型語言模型(LLMs)在專業決策中作為正式第二意見工具的角色,特別關注複雜醫學案例,即使是經驗豐富的醫生也尋求同行諮詢的情況。該研究分析了 Medscape 上 20 個月內的 183 個具有挑戰性的醫學案例,測試了多個LLMs在與眾包醫生回應的表現。一個關鍵發現是在最新的基礎模型中可以獲得很高的整體分數(>80% 的準確率,與共識意見相比),這超過了對同一臨床案例的大部分人類指標(450 頁的病人檔案、檢驗結果)的報告。該研究評估了LLMs在簡單案例(>81% 準確率)和複雜情境(43% 準確率)之間的表現差異,特別是在這些引起人類醫生之間廣泛辯論的案例中。研究表明,LLMs可能有價值作為全面差異診斷的生成器,而不是作為主要診斷工具,有助於對抗臨床決策中的認知偏見,減輕認知負荷,從而消除某些醫療錯誤的來源。第二個比較法律數據集(最高法院案例,N=21)的加入為促進第二意見的AI使用提供了額外的實證背景,盡管這些法律挑戰對LLMs進行分析明顯較容易。除了為LLM的準確性提供原始貢獻的實證證據外,該研究還匯總了一個新的基準,供他人評估LLMs和意見不一致的人類從業者之間高度爭議的問題和答案可靠性。這些結果表明,在專業環境中LLMs的最佳應用可能與目前強調自動化例行任務的方法有很大不同。
English
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making, particularly focusing on complex medical cases where even experienced physicians seek peer consultation. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses. A key finding was the high overall score possible in the latest foundational models (>80% accuracy compared to consensus opinion), which exceeds most human metrics reported on the same clinical cases (450 pages of patient profiles, test results). The study rates the LLMs' performance disparity between straightforward cases (>81% accuracy) and complex scenarios (43% accuracy), particularly in these cases generating substantial debate among human physicians. The research demonstrates that LLMs may be valuable as generators of comprehensive differential diagnoses rather than as primary diagnostic tools, potentially helping to counter cognitive biases in clinical decision-making, reduce cognitive loads, and thus remove some sources of medical error. The inclusion of a second comparative legal dataset (Supreme Court cases, N=21) provides added empirical context to the AI use to foster second opinions, though these legal challenges proved considerably easier for LLMs to analyze. In addition to the original contributions of empirical evidence for LLM accuracy, the research aggregated a novel benchmark for others to score highly contested question and answer reliability between both LLMs and disagreeing human practitioners. These results suggest that the optimal deployment of LLMs in professional settings may differ substantially from current approaches that emphasize automation of routine tasks.

Summary

AI-Generated Summary

PDF22November 16, 2024