多模态检索增强生成综合调查:任意模态下的问答
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
February 12, 2025
作者: Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari
cs.AI
摘要
大型语言模型(LLMs)因依赖静态训练数据而面临幻觉和知识过时的问题。检索增强生成(RAG)通过整合外部动态信息来缓解这些问题,从而增强事实性和时效性基础。多模态学习的最新进展推动了多模态RAG的发展,它融合了文本、图像、音频和视频等多种模态,以提升生成输出的质量。然而,跨模态对齐与推理为多模态RAG带来了独特的挑战,使其区别于传统的单模态RAG。本综述对多模态RAG系统进行了结构化和全面的分析,涵盖了数据集、指标、基准、评估、方法以及在检索、融合、增强和生成方面的创新。我们详细审视了训练策略、鲁棒性增强和损失函数,同时探讨了多样化的多模态RAG应用场景。此外,我们讨论了开放挑战和未来研究方向,以支持这一不断演进领域的进步。本综述为开发更强大、更可靠的AI系统奠定了基础,这些系统能够有效利用多模态动态外部知识库。相关资源可在https://github.com/llm-lab-org/Multimodal-RAG-Survey获取。
English
Large Language Models (LLMs) struggle with hallucinations and outdated
knowledge due to their reliance on static training data. Retrieval-Augmented
Generation (RAG) mitigates these issues by integrating external dynamic
information enhancing factual and updated grounding. Recent advances in
multimodal learning have led to the development of Multimodal RAG,
incorporating multiple modalities such as text, images, audio, and video to
enhance the generated outputs. However, cross-modal alignment and reasoning
introduce unique challenges to Multimodal RAG, distinguishing it from
traditional unimodal RAG. This survey offers a structured and comprehensive
analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks,
evaluation, methodologies, and innovations in retrieval, fusion, augmentation,
and generation. We precisely review training strategies, robustness
enhancements, and loss functions, while also exploring the diverse Multimodal
RAG scenarios. Furthermore, we discuss open challenges and future research
directions to support advancements in this evolving field. This survey lays the
foundation for developing more capable and reliable AI systems that effectively
leverage multimodal dynamic external knowledge bases. Resources are available
at https://github.com/llm-lab-org/Multimodal-RAG-Survey.Summary
AI-Generated Summary