LLM中的臨床知識並未轉化為人際互動能力

摘要

全球醫療機構正在探索使用大型語言模型（LLMs）向公眾提供醫療建議。目前，LLMs在醫療執照考試中幾乎獲得滿分，但這並不一定意味著在實際應用中能準確發揮作用。我們在一項包含1,298名參與者的對照研究中，測試了LLMs能否協助公眾識別潛在病情並選擇應對方案（處置），共涉及十種醫療情境。參與者被隨機分配接受LLM（GPT-4o、Llama 3、Command R+）或自選來源（對照組）的協助。單獨測試時，LLMs能準確完成情境，平均正確識別病情達94.9%，處置方案達56.3%。然而，使用相同LLMs的參與者僅在不到34.5%的情況下識別出相關病情，處置方案選擇率也低於44.2%，兩者均未優於對照組。我們發現，用戶互動是LLMs在醫療建議應用中的一大挑戰。現有的醫療知識標準測試和模擬患者互動並未能預測我們在人類參與者中觀察到的失敗案例。展望未來，我們建議在醫療領域的公開部署前，進行系統性的人類用戶測試，以評估其互動能力。

English

Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.

LLM中的臨床知識並未轉化為人際互動能力

Clinical knowledge in LLMs does not translate to human interactions

摘要

Summary

Support

Support