AutoMIR:无需相关标签的有效零样本医学信息检索
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels
October 26, 2024
作者: Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu
cs.AI
摘要
医学信息检索(MIR)对于从多种来源检索相关的医学知识至关重要,包括电子健康记录、科学文献和医学数据库。然而,在医学领域实现有效的零样本密集检索面临重大挑战,因为缺乏相关标记数据。本文介绍了一种名为自学习假设文档嵌入(SL-HyDE)的新方法来解决这个问题。SL-HyDE利用大型语言模型(LLMs)作为生成器,基于给定查询生成假设文档。这些生成的文档包含关键的医学背景,指导密集检索器识别最相关的文档。自学习框架逐渐改进伪文档生成和检索,利用未标记的医学语料库,而无需任何相关标记数据。此外,我们提出了中国医学信息检索基准(CMIRB),这是一个基于真实医学场景的全面评估框架,包括五个任务和十个数据集。通过在CMIRB上对十种模型进行基准测试,我们为评估医学信息检索系统建立了严格的标准。实验结果表明,SL-HyDE在检索准确性方面明显优于现有方法,同时展示了在各种LLM和检索器配置上的强大泛化能力和可扩展性。CMIRB数据和评估代码可在以下网址公开获取:https://github.com/CMIRB-benchmark/CMIRB。
English
Medical information retrieval (MIR) is essential for retrieving relevant
medical knowledge from diverse sources, including electronic health records,
scientific literature, and medical databases. However, achieving effective
zero-shot dense retrieval in the medical domain poses substantial challenges
due to the lack of relevance-labeled data. In this paper, we introduce a novel
approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to
tackle this issue. SL-HyDE leverages large language models (LLMs) as generators
to generate hypothetical documents based on a given query. These generated
documents encapsulate key medical context, guiding a dense retriever in
identifying the most relevant documents. The self-learning framework
progressively refines both pseudo-document generation and retrieval, utilizing
unlabeled medical corpora without requiring any relevance-labeled data.
Additionally, we present the Chinese Medical Information Retrieval Benchmark
(CMIRB), a comprehensive evaluation framework grounded in real-world medical
scenarios, encompassing five tasks and ten datasets. By benchmarking ten models
on CMIRB, we establish a rigorous standard for evaluating medical information
retrieval systems. Experimental results demonstrate that SL-HyDE significantly
surpasses existing methods in retrieval accuracy while showcasing strong
generalization and scalability across various LLM and retriever configurations.
CMIRB data and evaluation code are publicly available at:
https://github.com/CMIRB-benchmark/CMIRB.Summary
AI-Generated Summary