医疗大语言模型容易受到干扰。
Medical large language models are easily distracted
April 1, 2025
作者: Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann
cs.AI
摘要
大型语言模型(LLMs)具有变革医疗领域的潜力,但现实世界的临床场景中往往包含可能影响其性能的无关信息。随着辅助技术的兴起,如环境听写——它能从实时患者互动中自动生成草稿记录——可能会引入额外的噪音,这使得评估LLMs过滤相关数据的能力变得至关重要。为探究此问题,我们开发了MedDistractQA,一个利用嵌入模拟现实世界干扰的USMLE风格问题作为基准的测试平台。我们的研究发现,干扰性陈述(即在非临床语境下使用具有临床意义的多义词或提及无关健康状态)可使LLM的准确率降低高达17.9%。常被提议用于提升模型性能的解决方案,如检索增强生成(RAG)和医学微调,并未改变这一影响,在某些情况下反而引入了新的混淆因素,进一步降低了性能。我们的研究结果表明,LLMs天生缺乏区分相关与无关临床信息所需的逻辑机制,这为其在现实世界中的应用带来了挑战。MedDistractQA及我们的研究成果强调了制定强有力的缓解策略的必要性,以增强LLMs对无关信息的抵御能力。
English
Large language models (LLMs) have the potential to transform medicine, but
real-world clinical scenarios contain extraneous information that can hinder
performance. The rise of assistive technologies like ambient dictation, which
automatically generates draft notes from live patient encounters, has the
potential to introduce additional noise making it crucial to assess the ability
of LLM's to filter relevant data. To investigate this, we developed
MedDistractQA, a benchmark using USMLE-style questions embedded with simulated
real-world distractions. Our findings show that distracting statements
(polysemous words with clinical meanings used in a non-clinical context or
references to unrelated health conditions) can reduce LLM accuracy by up to
17.9%. Commonly proposed solutions to improve model performance such as
retrieval-augmented generation (RAG) and medical fine-tuning did not change
this effect and in some cases introduced their own confounders and further
degraded performance. Our findings suggest that LLMs natively lack the logical
mechanisms necessary to distinguish relevant from irrelevant clinical
information, posing challenges for real-world applications. MedDistractQA and
our results highlights the need for robust mitigation strategies to enhance LLM
resilience to extraneous information.Summary
AI-Generated Summary