ChatPaper.aiChatPaper

AutoMIR:在沒有相關標籤的情況下實現有效的零樣本醫學信息檢索

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

October 26, 2024
作者: Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu
cs.AI

摘要

醫學資訊檢索(MIR)對於從各種來源檢索相關的醫學知識至關重要,包括電子健康記錄、科學文獻和醫學數據庫。然而,在醫學領域實現有效的零樣本密集檢索面臨著重大挑戰,這是由於缺乏相關標記數據所導致的。本文介紹了一種名為自學習假設文件嵌入(SL-HyDE)的新方法來應對這個問題。SL-HyDE利用大型語言模型(LLMs)作為生成器,根據給定的查詢生成假設文件。這些生成的文件包含關鍵的醫學背景,引導密集檢索器識別最相關的文件。自學習框架逐漸改進偽文件生成和檢索,利用未標記的醫學語料庫,而無需任何相關標記數據。此外,我們提出了中文醫學資訊檢索基準(CMIRB),這是一個基於現實醫學場景的全面評估框架,包括五個任務和十個數據集。通過在CMIRB上對十個模型進行基準測試,我們建立了一個嚴格的標準,用於評估醫學資訊檢索系統。實驗結果表明,SL-HyDE在檢索準確性方面顯著優於現有方法,同時展示了在各種LLM和檢索器配置上的強大泛化性和可擴展性。CMIRB數據和評估代碼可在以下網址公開獲取:https://github.com/CMIRB-benchmark/CMIRB。
English
Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: https://github.com/CMIRB-benchmark/CMIRB.

Summary

AI-Generated Summary

PDF82November 16, 2024