MMDocIR：长文档多模态检索基准测试

摘要

多模态文档检索旨在识别和检索各种形式的多模态内容，如图表、表格、图表以及来自大量文档的布局信息。尽管其重要性，但目前缺乏一个强大的基准来有效评估多模态文档检索系统的性能。为填补这一空白，本研究引入了一个新的基准，命名为MMDocIR，涵盖两个不同的任务：页面级和布局级检索。前者侧重于定位长文档中最相关的页面，而后者则针对特定布局的检测，提供比整页分析更精细的粒度。布局可以指各种元素，如文字段落、方程式、图表或表格。MMDocIR基准包含一个丰富的数据集，为1,685个问题提供了专家注释标签，并为173,843个问题提供了引导标签，使其成为推动多模态文档检索的重要资源，既用于训练又用于评估。通过严格的实验，我们发现（i）视觉检索器明显优于其文本对应物，（ii）MMDocIR训练集可以有效地促进多模态文档检索的训练过程，（iii）利用VLM-text的文本检索器比使用OCR-text的效果要好得多。这些发现强调了整合视觉元素进行多模态文档检索的潜在优势。

English

Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

MMDocIR：长文档多模态检索基准测试

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

摘要

Summary

Support