ClinicalBench: 임상 예측에서 LLMs가 전통적인 ML 모델을 이길 수 있을까?

초록

대규모 언어 모델(LLMs)은 의학 텍스트 처리 작업 및 의학 면허 시험에서 우수한 성능을 보유하여 현재의 임상 시스템을 혁신할 수 있는 큰 가능성을 가지고 있습니다. 한편, SVM 및 XGBoost와 같은 전통적인 ML 모델은 여전히 주로 임상 예측 작업에 채택되고 있습니다. 한 가지 신흥 질문은 LLMs가 임상 예측에서 전통적인 ML 모델을 이길 수 있을까요? 따라서 우리는 일반용 및 의학용 LLMs의 임상 예측 모델링 능력을 철저히 연구하고 전통적인 ML 모델과 비교하기 위해 새로운 기준인 ClinicalBench를 구축했습니다. ClinicalBench는 세 가지 일반적인 임상 예측 작업, 두 개의 데이터베이스, 14개의 일반용 LLMs, 8개의 의학용 LLMs 및 11개의 전통적인 ML 모델을 포함하고 있습니다. 광범위한 경험적 조사를 통해, 우리는 다양한 모델 규모, 다양한 프롬프팅 또는 파인튜닝 전략을 사용하더라도 일반용 및 의학용 LLMs가 아직도 임상 예측에서 전통적인 ML 모델을 이길 수 없음을 발견했습니다. 이는 그들의 임상 추론 및 의사 결정 능력에 대한 잠재적인 결핍을 밝혀주며, 임상 응용 프로그램에서 LLMs를 채택할 때 신중함을 요구합니다. ClinicalBench는 의료 분야에서 LLMs의 개발과 현실 세계의 임상 실무 사이의 간극을 줄이는 데 활용될 수 있습니다.

English

Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.

ClinicalBench: 임상 예측에서 LLMs가 전통적인 ML 모델을 이길 수 있을까?

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

초록

Support