ChatPaper.aiChatPaper

ViDoRAG:基于动态迭代推理代理的视觉文档检索增强生成

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

February 25, 2025
作者: Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
cs.AI

摘要

理解视觉丰富文档中的信息,对于传统的检索增强生成(RAG)方法而言,仍是一项重大挑战。现有基准测试主要集中于基于图像的问答(QA),却忽视了在密集视觉文档中进行高效检索、理解与推理的基本难题。为填补这一空白,我们引入了ViDoSeek,一个专为评估RAG在需要复杂推理的视觉丰富文档上的性能而设计的新颖数据集。基于此,我们识别出当前RAG方法的关键局限:(i)纯视觉检索方法难以有效整合文本与视觉特征,(ii)先前方法常分配不足的推理标记,限制了其效能。针对这些挑战,我们提出了ViDoRAG,一个专为跨视觉文档复杂推理量身定制的多代理RAG框架。ViDoRAG采用基于高斯混合模型(GMM)的混合策略,以高效处理多模态检索。为进一步激发模型的推理能力,我们引入了一个包含探索、总结与反思的迭代代理工作流程,为研究RAG领域中的测试时扩展提供了框架。在ViDoSeek上的大量实验验证了我们方法的有效性与泛化能力。值得注意的是,ViDoRAG在竞争性的ViDoSeek基准上,以超过10%的优势超越了现有方法。
English
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.

Summary

AI-Generated Summary

PDF182March 3, 2025