MMDocIR:长文档多模态检索基准测试
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
January 15, 2025
作者: Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
cs.AI
摘要
多模态文档检索旨在识别和检索各种形式的多模态内容,如图表、表格、图表以及来自大量文档的布局信息。尽管其重要性,但目前缺乏一个强大的基准来有效评估多模态文档检索系统的性能。为填补这一空白,本研究引入了一个新的基准,命名为MMDocIR,涵盖两个不同的任务:页面级和布局级检索。前者侧重于定位长文档中最相关的页面,而后者则针对特定布局的检测,提供比整页分析更精细的粒度。布局可以指各种元素,如文字段落、方程式、图表或表格。MMDocIR基准包含一个丰富的数据集,为1,685个问题提供了专家注释标签,并为173,843个问题提供了引导标签,使其成为推动多模态文档检索的重要资源,既用于训练又用于评估。通过严格的实验,我们发现(i)视觉检索器明显优于其文本对应物,(ii)MMDocIR训练集可以有效地促进多模态文档检索的训练过程,(iii)利用VLM-text的文本检索器比使用OCR-text的效果要好得多。这些发现强调了整合视觉元素进行多模态文档检索的潜在优势。
English
Multi-modal document retrieval is designed to identify and retrieve various
forms of multi-modal content, such as figures, tables, charts, and layout
information from extensive documents. Despite its significance, there is a
notable lack of a robust benchmark to effectively evaluate the performance of
systems in multi-modal document retrieval. To address this gap, this work
introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks:
page-level and layout-level retrieval. The former focuses on localizing the
most relevant pages within a long document, while the latter targets the
detection of specific layouts, offering a more fine-grained granularity than
whole-page analysis. A layout can refer to a variety of elements such as
textual paragraphs, equations, figures, tables, or charts. The MMDocIR
benchmark comprises a rich dataset featuring expertly annotated labels for
1,685 questions and bootstrapped labels for 173,843 questions, making it a
pivotal resource for advancing multi-modal document retrieval for both training
and evaluation. Through rigorous experiments, we reveal that (i) visual
retrievers significantly outperform their text counterparts, (ii) MMDocIR train
set can effectively benefit the training process of multi-modal document
retrieval and (iii) text retrievers leveraging on VLM-text perform much better
than those using OCR-text. These findings underscores the potential advantages
of integrating visual elements for multi-modal document retrieval.Summary
AI-Generated Summary