医学问答中检索增强生成系统的全面实用评估

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

November 14, 2024
作者: Nghia Trung Ngo, Chien Van Nguyen, Franck Dernoncourt, Thien Huu Nguyen
cs.AI

摘要

检索增强生成(RAG)已成为增强大型语言模型(LLMs)在知识密集型任务中表现的一种有前景的方法,例如医学领域的任务。然而,医学领域的敏感性要求系统完全准确可信。现有的RAG基准主要关注标准的检索-回答设置,却忽略了许多衡量可靠医疗系统关键方面的实际场景。本文通过为医学问答(QA)系统在RAG设置下的这些情况提供全面评估框架来填补这一空白,包括充分性、整合性和鲁棒性。我们引入医学检索增强生成基准(MedRGB),为四个医学QA数据集提供各种补充元素,以测试LLMs处理这些特定场景的能力。利用MedRGB,我们对商业LLMs和开源模型在多种检索条件下进行了广泛评估。我们的实验结果显示当前模型在处理检索文档中的噪音和错误信息方面能力有限。我们进一步分析LLMs的推理过程,为在这一关键医学领域开发RAG系统提供宝贵见解和未来方向。
English
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.

Summary

AI-Generated Summary

PDF62November 19, 2024