MMDocIR:長文檔多模檢索基準測試

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

January 15, 2025
作者: Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
cs.AI

摘要

多模態文件檢索旨在識別和檢索各種形式的多模態內容,例如圖片、表格、圖表和版面信息,以及來自大量文件的信息。儘管其重要性,目前尚缺乏一個強大的基準來有效評估多模態文件檢索系統的性能。為填補這一空白,本研究引入了一個新的基準,名為MMDocIR,包括兩個不同的任務:頁級和版面級檢索。前者專注於在長文檔中定位最相關的頁面,而後者則旨在檢測特定版面,提供比整頁分析更細緻的粒度。版面可以指各種元素,如文本段落、方程式、圖片、表格或圖表。MMDocIR基準包含一個豐富的數據集,其中專家標註了1,685個問題的標籤,並為173,843個問題提供了引導標籤,使其成為推進多模態文件檢索的重要資源,既可用於訓練也可用於評估。通過嚴格的實驗,我們揭示了:(i)視覺檢索器明顯優於其文本對應物,(ii)MMDocIR訓練集可以有效地促進多模態文件檢索的訓練過程,以及(iii)利用VLM-text的文本檢索器比使用OCR-text的檢索器表現更好。這些發現強調了整合視覺元素進行多模態文件檢索的潛在優勢。
English
Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

Summary

AI-Generated Summary

PDF242January 16, 2025