MedHallu：大型语言模型医学幻觉检测综合基准

摘要

大型语言模型（LLMs）的进步及其在医疗问答中的日益广泛应用，亟需对其可靠性进行严格评估。一个关键挑战在于“幻觉”现象，即模型生成看似合理但实际错误的内容。在医疗领域，这给患者安全和临床决策带来了严重风险。为此，我们推出了MedHallu，这是首个专门针对医疗幻觉检测的基准测试。MedHallu包含10,000个源自PubMedQA的高质量问答对，通过受控流程系统性地生成了幻觉答案。实验表明，包括GPT-4o、Llama-3.1及经过医疗领域微调的UltraMedical在内的最先进LLMs，在这一二元幻觉检测任务上表现欠佳，最佳模型在检测“困难”类别幻觉时的F1分数低至0.625。通过双向蕴含聚类分析，我们发现难以检测的幻觉在语义上更接近真实答案。实验还表明，融入领域特定知识并引入“不确定”作为回答类别之一，相较于基线方法，精度和F1分数可提升高达38%。

English

Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

MedHallu：大型语言模型医学幻觉检测综合基准

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

摘要

Summary

Support