ChatPaper.aiChatPaper

大规模语言模型在罕见病鉴别诊断中的应用: 从腹部放线菌病到威尔逊氏病

Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease

February 20, 2025
作者: Elliot Schumacher, Dhruv Naik, Anitha Kannan
cs.AI

摘要

大型语言模型(LLMs)在疾病诊断方面展现了令人瞩目的能力。然而,其在识别罕见疾病方面的有效性,这些疾病本身诊断难度更大,仍是一个待解的问题。随着LLMs在医疗环境中的日益普及,罕见疾病的诊断性能显得尤为关键。特别是在初级保健医生仅需通过患者对话做出罕见预后判断,以便采取适当后续措施的情况下。为此,多种临床决策支持系统被设计用于辅助医疗提供者识别罕见疾病。然而,由于这些系统对常见疾病知识的缺乏及使用上的不便,其效用受到限制。 本文提出RareScale,旨在将LLMs的知识与专家系统相结合。我们联合使用专家系统和LLM来模拟罕见疾病对话,利用这些数据训练一个罕见疾病候选预测模型。随后,该小型模型生成的候选结果作为额外输入,提供给黑箱LLM以做出最终的鉴别诊断。因此,RareScale实现了罕见与常见诊断之间的平衡。我们展示了涵盖575种以上罕见疾病的结果,从腹部放线菌病开始,至威尔逊病结束。我们的方法显著提升了黑箱LLM的基线性能,在Top-5准确率上提高了超过17%。此外,我们发现候选生成性能表现优异(例如,在gpt-4o生成的对话中达到88.8%的准确率)。
English
Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).

Summary

AI-Generated Summary

PDF22February 24, 2025