MDocAgent: Un Framework Multi-Modale e Multi-Agente per la Comprensione dei Documenti

Abstract

Il Document Question Answering (DocQA) è un compito molto comune. I metodi esistenti che utilizzano Large Language Models (LLMs) o Large Vision Language Models (LVLMs) e Retrieval Augmented Generation (RAG) spesso privilegiano le informazioni provenienti da una singola modalità, non riuscendo a integrare efficacemente gli indizi testuali e visivi. Questi approcci faticano nel ragionamento multi-modale complesso, limitando le loro prestazioni sui documenti del mondo reale. Presentiamo MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), un nuovo framework RAG e multi-agente che sfrutta sia il testo che le immagini. Il nostro sistema impiega cinque agenti specializzati: un agente generale, un agente critico, un agente testuale, un agente visivo e un agente di sintesi. Questi agenti partecipano al recupero contestuale multi-modale, combinando le loro intuizioni individuali per ottenere una comprensione più completa del contenuto del documento. Questo approccio collaborativo consente al sistema di sintetizzare le informazioni provenienti sia dai componenti testuali che visivi, portando a una maggiore accuratezza nel rispondere alle domande. Esperimenti preliminari su cinque benchmark come MMLongBench e LongDocURL dimostrano l'efficacia del nostro MDocAgent, ottenendo un miglioramento medio del 12,1% rispetto ai metodi attuali all'avanguardia. Questo lavoro contribuisce allo sviluppo di sistemi DocQA più robusti e completi, in grado di gestire le complessità dei documenti del mondo reale contenenti ricche informazioni testuali e visive. I nostri dati e il codice sono disponibili all'indirizzo https://github.com/aiming-lab/MDocAgent.

English

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

MDocAgent: Un Framework Multi-Modale e Multi-Agente per la Comprensione dei Documenti

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Abstract

Summary

Support

Support