ChatPaper.aiChatPaper

LoRA-大型多模態模型的情境適應,用於長文檔理解

LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding

November 2, 2024
作者: Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun
cs.AI

摘要

最近,大型多模型模型(LMMs)在理解文本豐富的圖像方面取得了巨大進展,但仍然在處理複雜、多頁面、視覺豐富的文件方面遇到困難。傳統方法使用文件解析器進行檢索增強生成存在性能和效率限制,而直接將所有頁面呈現給LMMs會導致效率低下,特別是對於長篇文件。在這項工作中,我們提出了一個名為LoRA-大型多模型模型的情境適應(LoCAL)的新框架,擴展了任何LMM的能力,以支持對長篇文件的理解。我們展示了LMMs可以有效地作為多模檢索器,檢索相關頁面以回答基於這些頁面的用戶問題。LoCAL實施了兩個特定的LMM適配器:一個用於證據頁面檢索,另一個用於問答。實證結果在公共基準測試中展示了最先進的性能,證明了LoCAL的有效性。
English
Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to LMMs leads to inefficiencies, especially with lengthy documents. In this work, we present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL), which broadens the capabilities of any LMM to support long-document understanding. We demonstrate that LMMs can effectively serve as multimodal retrievers, fetching relevant pages to answer user questions based on these pages. LoCAL is implemented with two specific LMM adapters: one for evidence page retrieval and another for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of LoCAL.

Summary

AI-Generated Summary

PDF42November 13, 2024