LLM中的臨床知識並未轉化為人際互動能力
Clinical knowledge in LLMs does not translate to human interactions
April 26, 2025
作者: Andrew M. Bean, Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera, Sara Hincapié Monsalve, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, Adam Mahdi
cs.AI
摘要
全球醫療機構正在探索使用大型語言模型(LLMs)向公眾提供醫療建議。目前,LLMs在醫療執照考試中幾乎獲得滿分,但這並不一定意味著在實際應用中能準確發揮作用。我們在一項包含1,298名參與者的對照研究中,測試了LLMs能否協助公眾識別潛在病情並選擇應對方案(處置),共涉及十種醫療情境。參與者被隨機分配接受LLM(GPT-4o、Llama 3、Command R+)或自選來源(對照組)的協助。單獨測試時,LLMs能準確完成情境,平均正確識別病情達94.9%,處置方案達56.3%。然而,使用相同LLMs的參與者僅在不到34.5%的情況下識別出相關病情,處置方案選擇率也低於44.2%,兩者均未優於對照組。我們發現,用戶互動是LLMs在醫療建議應用中的一大挑戰。現有的醫療知識標準測試和模擬患者互動並未能預測我們在人類參與者中觀察到的失敗案例。展望未來,我們建議在醫療領域的公開部署前,進行系統性的人類用戶測試,以評估其互動能力。
English
Global healthcare providers are exploring use of large language models (LLMs)
to provide medical advice to the public. LLMs now achieve nearly perfect scores
on medical licensing exams, but this does not necessarily translate to accurate
performance in real-world settings. We tested if LLMs can assist members of the
public in identifying underlying conditions and choosing a course of action
(disposition) in ten medical scenarios in a controlled study with 1,298
participants. Participants were randomly assigned to receive assistance from an
LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested
alone, LLMs complete the scenarios accurately, correctly identifying conditions
in 94.9% of cases and disposition in 56.3% on average. However, participants
using the same LLMs identified relevant conditions in less than 34.5% of cases
and disposition in less than 44.2%, both no better than the control group. We
identify user interactions as a challenge to the deployment of LLMs for medical
advice. Standard benchmarks for medical knowledge and simulated patient
interactions do not predict the failures we find with human participants.
Moving forward, we recommend systematic human user testing to evaluate
interactive capabilities prior to public deployments in healthcare.Summary
AI-Generated Summary