ChatPaper.aiChatPaper

LoRA-大型多模态模型的上下文自适应在长文档理解中的应用

LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding

November 2, 2024
作者: Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun
cs.AI

摘要

最近,大型多模态模型(LMMs)在文本丰富的图像理解方面取得了巨大进展,但仍然在处理复杂、多页、视觉丰富的文档时存在困难。传统方法使用文档解析器进行检索增强生成存在性能和效率限制,而直接将所有页面呈现给LMMs会导致效率低下,特别是在处理长文档时。在这项工作中,我们提出了一个名为LoRA-大型多模态模型上下文适应(LoCAL)的新框架,它扩展了任何LMM的能力,以支持长文档理解。我们证明LMMs可以有效地作为多模态检索器,检索相关页面以回答用户问题。LoCAL采用两个特定的LMM适配器实现:一个用于证据页面检索,另一个用于问答。实证结果显示在公共基准测试中表现卓越,展示了LoCAL的有效性。
English
Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to LMMs leads to inefficiencies, especially with lengthy documents. In this work, we present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL), which broadens the capabilities of any LMM to support long-document understanding. We demonstrate that LMMs can effectively serve as multimodal retrievers, fetching relevant pages to answer user questions based on these pages. LoCAL is implemented with two specific LMM adapters: one for evidence page retrieval and another for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of LoCAL.

Summary

AI-Generated Summary

PDF42November 13, 2024