ChatPaper.aiChatPaper

M-Longdoc:多模态超长文档理解基准及检索感知调整框架

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

November 9, 2024
作者: Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing
cs.AI

摘要

理解并回答文件中的问题的能力在许多商业和实际应用中可能很有用。然而,文件通常包含大量多模态内容,如文本、图形和表格,对人类来说要彻底阅读这些内容非常耗时。因此,迫切需要开发有效的自动化方法来帮助人类完成这项任务。在这项工作中,我们介绍了M-LongDoc,一个包含851个样本的基准测试,并提出了一个自动化框架来评估大型多模态模型的性能。我们进一步提出了一种基于检索意识的调优方法,用于高效和有效地阅读多模态文档。与现有作品相比,我们的基准测试包含了更近期和更长的文档,有数百页,同时需要开放式解决方案,而不仅仅是抽取式答案。据我们所知,我们的训练框架是第一个直接针对多模态长文档的检索设置的。为了调整开源模型,我们以完全自动的方式构建了一个用于在此类文档上进行问答任务的训练语料库。实验表明,与基线开源模型相比,我们的调优方法使模型响应的正确性相对提高了4.6%。我们的数据、代码和模型可在https://multimodal-documents.github.io 上获得。
English
The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.

Summary

AI-Generated Summary

PDF462November 12, 2024