VisDoM:使用多模檢索增強生成的視覺豐富元素進行多文檔問答
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
December 14, 2024
作者: Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha
cs.AI
摘要
從多個文件集合中理解資訊,尤其是那些具有豐富視覺元素的文件,對於基於文件的問答至關重要。本文介紹了VisDoMBench,這是第一個旨在評估具有豐富多模態內容(包括表格、圖表和簡報幻燈片)的多文件設置中QA系統的全面基準。我們提出了VisDoMRAG,一種新穎的多模態檢索增強生成(RAG)方法,同時利用視覺和文本RAG,結合強大的視覺檢索能力與複雜的語言推理。VisDoMRAG採用多步推理過程,包括證據整理和思維鏈推理,用於同時進行文本和視覺RAG管道。VisDoMRAG的一個關鍵創新之處在於其受一致性約束的模態融合機制,該機制在推論時對齊跨模態的推理過程,以產生一致的最終答案。這導致在關鍵信息分佈在不同模態且通過隱含上下文歸因改進答案可驗證性的情況下,提高準確性。通過涉及開源和專有大型語言模型的廣泛實驗,我們在VisDoMBench上對最先進的文件QA方法進行基準測試。廣泛的結果顯示,VisDoMRAG在多模態文件QA的端對端方面優於單模態和長上下文LLM基線12-20%。
English
Understanding information from a collection of multiple documents,
particularly those with visually rich elements, is important for
document-grounded question answering. This paper introduces VisDoMBench, the
first comprehensive benchmark designed to evaluate QA systems in multi-document
settings with rich multimodal content, including tables, charts, and
presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval
Augmented Generation (RAG) approach that simultaneously utilizes visual and
textual RAG, combining robust visual retrieval capabilities with sophisticated
linguistic reasoning. VisDoMRAG employs a multi-step reasoning process
encompassing evidence curation and chain-of-thought reasoning for concurrent
textual and visual RAG pipelines. A key novelty of VisDoMRAG is its
consistency-constrained modality fusion mechanism, which aligns the reasoning
processes across modalities at inference time to produce a coherent final
answer. This leads to enhanced accuracy in scenarios where critical information
is distributed across modalities and improved answer verifiability through
implicit context attribution. Through extensive experiments involving
open-source and proprietary large language models, we benchmark
state-of-the-art document QA methods on VisDoMBench. Extensive results show
that VisDoMRAG outperforms unimodal and long-context LLM baselines for
end-to-end multimodal document QA by 12-20%.Summary
AI-Generated Summary