對於醫學問答,檢索增強生成系統的全面且實用評估
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
November 14, 2024
作者: Nghia Trung Ngo, Chien Van Nguyen, Franck Dernoncourt, Thien Huu Nguyen
cs.AI
摘要
檢索增強生成(RAG)已成為增強大型語言模型(LLMs)在知識密集型任務中表現的一種有前途的方法,例如醫學領域的任務。然而,醫學領域的敏感性要求系統完全準確可信。現有的RAG基準主要集中在標準的檢索-回答設置上,卻忽略了許多衡量可靠醫學系統關鍵方面的實際情境。本文通過為醫學問答(QA)系統在這些情況下的RAG設置提供全面的評估框架來解決這一缺口,包括充分性、整合性和韌性。我們引入醫學檢索增強生成基準(MedRGB),為四個醫學QA數據集提供各種補充元素,以測試LLMs處理這些特定情境的能力。利用MedRGB,我們對商用LLMs和開源模型在多種檢索條件下進行了廣泛評估。我們的實驗結果顯示目前模型在處理檢索文檔中的噪音和錯誤信息方面的能力有限。我們進一步分析LLMs的推理過程,為在這一關鍵醫學領域發展RAG系統提供有價值的見解和未來方向。
English
Retrieval-augmented generation (RAG) has emerged as a promising approach to
enhance the performance of large language models (LLMs) in knowledge-intensive
tasks such as those from medical domain. However, the sensitive nature of the
medical domain necessitates a completely accurate and trustworthy system. While
existing RAG benchmarks primarily focus on the standard retrieve-answer
setting, they overlook many practical scenarios that measure crucial aspects of
a reliable medical system. This paper addresses this gap by providing a
comprehensive evaluation framework for medical question-answering (QA) systems
in a RAG setting for these situations, including sufficiency, integration, and
robustness. We introduce Medical Retrieval-Augmented Generation Benchmark
(MedRGB) that provides various supplementary elements to four medical QA
datasets for testing LLMs' ability to handle these specific scenarios.
Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art
commercial LLMs and open-source models across multiple retrieval conditions.
Our experimental results reveals current models' limited ability to handle
noise and misinformation in the retrieved documents. We further analyze the
LLMs' reasoning processes to provides valuable insights and future directions
for developing RAG systems in this critical medical domain.Summary
AI-Generated Summary