對於醫學問答，檢索增強生成系統的全面且實用評估

摘要

檢索增強生成（RAG）已成為增強大型語言模型（LLMs）在知識密集型任務中表現的一種有前途的方法，例如醫學領域的任務。然而，醫學領域的敏感性要求系統完全準確可信。現有的RAG基準主要集中在標準的檢索-回答設置上，卻忽略了許多衡量可靠醫學系統關鍵方面的實際情境。本文通過為醫學問答（QA）系統在這些情況下的RAG設置提供全面的評估框架來解決這一缺口，包括充分性、整合性和韌性。我們引入醫學檢索增強生成基準（MedRGB），為四個醫學QA數據集提供各種補充元素，以測試LLMs處理這些特定情境的能力。利用MedRGB，我們對商用LLMs和開源模型在多種檢索條件下進行了廣泛評估。我們的實驗結果顯示目前模型在處理檢索文檔中的噪音和錯誤信息方面的能力有限。我們進一步分析LLMs的推理過程，為在這一關鍵醫學領域發展RAG系統提供有價值的見解和未來方向。

English

Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.

對於醫學問答，檢索增強生成系統的全面且實用評估

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

摘要

Summary

Support