M-Longdoc：多模態超長文檔理解基準和具檢索意識的調整框架

摘要

在許多商業和實際應用中，理解並回答文件中的問題的能力可能非常有用。然而，文件通常包含冗長且多樣的多模式內容，如文本、圖表和表格，這對人類來說需要花費大量時間來仔細閱讀。因此，迫切需要開發有效且自動化的方法來幫助人類完成這項任務。在這項工作中，我們介紹了M-LongDoc，這是一個包含851個樣本的基準測試集，以及一個自動化框架來評估大型多模式模型的性能。我們進一步提出了一種基於檢索意識的調整方法，用於高效和有效地閱讀多模式文件。與現有作品相比，我們的基準測試集包含了最近且冗長的文件，有數百頁之多，同時需要開放式解決方案，而不僅僅是提取式答案。據我們所知，我們的訓練框架是第一個直接針對多模式長文檔的檢索設置進行處理的。為了使開源模型能夠進行調整，我們以完全自動的方式構建了一個用於問答任務的訓練語料庫。實驗表明，與基準開源模型相比，我們的調整方法使模型回答的正確性相對提高了4.6%。我們的數據、代碼和模型可在https://multimodal-documents.github.io 上獲得。

English

The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.

M-Longdoc：多模態超長文檔理解基準和具檢索意識的調整框架

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

摘要

Summary

Support