臨床評估:LLM是否能在臨床預測中勝過傳統機器學習模型?
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
November 10, 2024
作者: Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Danielle Bitterman, Fei Wang, Kai Shu
cs.AI
摘要
大型語言模型(LLMs)具有極大的潛力,可以革新當前的臨床系統,因為它們在醫學文本處理任務和醫學許可考試方面具有卓越的能力。與此同時,傳統的機器學習模型,如支持向量機(SVM)和XGBoost,仍然主要應用於臨床預測任務中。一個新興問題是,LLMs能否在臨床預測中擊敗傳統的機器學習模型?因此,我們建立了一個新的基準測試ClinicalBench,來全面研究通用和醫學LLMs的臨床預測建模能力,並將它們與傳統機器學習模型進行比較。ClinicalBench包含三個常見的臨床預測任務、兩個數據庫、14個通用LLMs、8個醫學LLMs和11個傳統機器學習模型。通過廣泛的實證研究,我們發現,無論是通用還是醫學LLMs,即使在不同的模型規模、不同的提示或微調策略下,仍然無法在臨床預測中擊敗傳統的機器學習模型,這揭示了它們在臨床推理和決策方面潛在的不足。我們呼籲從業者在臨床應用中使用LLMs時要謹慎。ClinicalBench可用於彌合LLMs在醫療保健領域發展和現實世界臨床實踐之間的差距。
English
Large Language Models (LLMs) hold great promise to revolutionize current
clinical systems for their superior capacities on medical text processing tasks
and medical licensing exams. Meanwhile, traditional ML models such as SVM and
XGBoost have still been mainly adopted in clinical prediction tasks. An
emerging question is Can LLMs beat traditional ML models in clinical
prediction? Thus, we build a new benchmark ClinicalBench to comprehensively
study the clinical predictive modeling capacities of both general-purpose and
medical LLMs, and compare them with traditional ML models. ClinicalBench
embraces three common clinical prediction tasks, two databases, 14
general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through
extensive empirical investigation, we discover that both general-purpose and
medical LLMs, even with different model scales, diverse prompting or
fine-tuning strategies, still cannot beat traditional ML models in clinical
prediction yet, shedding light on their potential deficiency in clinical
reasoning and decision-making. We call for caution when practitioners adopt
LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap
between LLMs' development for healthcare and real-world clinical practice.Summary
AI-Generated Summary