M-Longdoc: 다중 모달 슈퍼-롱 문서 이해를 위한 벤치마크 및 검색 인식 튜닝 프레임워크

초록

문서를 이해하고 질문에 답하는 능력은 많은 비즈니스 및 실용적인 응용 분야에서 유용할 수 있습니다. 그러나 문서에는 텍스트, 그림 및 표와 같은 다양한 다중 모달 콘텐츠가 포함되어 있어 인간이 완전히 읽는 데 매우 시간이 소요됩니다. 따라서 이 작업을 돕기 위한 효과적이고 자동화된 방법을 개발하는 긴급한 필요성이 있습니다. 본 연구에서는 851개의 샘플로 구성된 벤치마크인 M-LongDoc을 소개하고 대규모 다중 모달 모델의 성능을 평가하는 자동화된 프레임워크를 제시합니다. 또한 효율적이고 효과적인 다중 모달 문서 읽기를 위한 검색 인식 튜닝 접근 방식을 제안합니다. 기존 작업과 비교하면, 저희의 벤치마크는 최근 및 긴 문서로 이루어져 있으며 수백 페이지에 이르며 추출적인 답변뿐만 아니라 개방형 해결책을 요구합니다. 저희의 훈련 프레임워크는 다중 모달 긴 문서의 검색 설정을 직접 다루는 첫 번째 것으로 알려져 있습니다. 오픈 소스 모델을 튜닝하기 위해 우리는 이러한 문서에 대한 질문 응답 작업을 위해 완전 자동화된 방식으로 훈련 말뭉치를 구축합니다. 실험 결과, 저희의 튜닝 접근 방식은 기준 오픈 소스 모델에 비해 모델 응답의 정확성에 대해 상대적인 4.6%의 향상을 달성합니다. 저희의 데이터, 코드 및 모델은 https://multimodal-documents.github.io에서 제공됩니다.

English

The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.

M-Longdoc: 다중 모달 슈퍼-롱 문서 이해를 위한 벤치마크 및 검색 인식 튜닝 프레임워크

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

초록

Summary

Support