VisDoM:利用多模态检索增强生成的方式进行具有丰富视觉元素的多文档问答
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
December 14, 2024
作者: Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha
cs.AI
摘要
理解来自多个文档集合的信息,特别是那些具有视觉丰富元素的文档,对于基于文档的问答至关重要。本文介绍了VisDoMBench,这是第一个旨在评估多文档环境中具有丰富多模态内容的问答系统的全面基准,其中包括表格、图表和演示文稿。我们提出了VisDoMRAG,一种新颖的多模态检索增强生成(RAG)方法,同时利用视觉和文本RAG,将强大的视觉检索能力与复杂的语言推理相结合。VisDoMRAG采用多步推理过程,包括证据整理和思维链推理,用于同时处理文本和视觉RAG流程。VisDoMRAG的一个关键创新是其受一致性约束的模态融合机制,该机制在推理时跨模态地对齐推理过程,以产生连贯的最终答案。这导致在关键信息分布在多个模态并通过隐含上下文归因提高答案可验证性的场景中,获得了更高的准确性。通过涉及开源和专有大型语言模型的广泛实验,我们在VisDoMBench上对最先进的文档问答方法进行了基准测试。广泛的结果显示,VisDoMRAG在端到端多模态文档问答中比单模态和长上下文LLM基线表现提高了12-20%。
English
Understanding information from a collection of multiple documents,
particularly those with visually rich elements, is important for
document-grounded question answering. This paper introduces VisDoMBench, the
first comprehensive benchmark designed to evaluate QA systems in multi-document
settings with rich multimodal content, including tables, charts, and
presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval
Augmented Generation (RAG) approach that simultaneously utilizes visual and
textual RAG, combining robust visual retrieval capabilities with sophisticated
linguistic reasoning. VisDoMRAG employs a multi-step reasoning process
encompassing evidence curation and chain-of-thought reasoning for concurrent
textual and visual RAG pipelines. A key novelty of VisDoMRAG is its
consistency-constrained modality fusion mechanism, which aligns the reasoning
processes across modalities at inference time to produce a coherent final
answer. This leads to enhanced accuracy in scenarios where critical information
is distributed across modalities and improved answer verifiability through
implicit context attribution. Through extensive experiments involving
open-source and proprietary large language models, we benchmark
state-of-the-art document QA methods on VisDoMBench. Extensive results show
that VisDoMRAG outperforms unimodal and long-context LLM baselines for
end-to-end multimodal document QA by 12-20%.Summary
AI-Generated Summary