VisDoM: 시각적으로 풍부한 요소를 활용한 다중 문서 QA: 다중 모달 검색 증강 생성

초록

다중 문서 집합으로부터 정보를 이해하는 것은 특히 시각적으로 풍부한 요소를 가진 문서에 대해 중요하며, 문서 기반 질문 응답을 평가하기 위해 설계된 첫 번째 포괄적인 벤치마크인 VisDoMBench를 소개하는 논문이다. 이 벤치마크는 테이블, 차트, 프레젠테이션 슬라이드를 포함한 풍부한 다중 모달 콘텐츠 환경에서 QA 시스템을 평가하는 것을 목적으로 한다. 우리는 시각적 및 텍스트 RAG를 동시에 활용하는 새로운 다중 모달 검색 증강 생성(RAG) 접근 방식인 VisDoMRAG를 제안한다. VisDoMRAG는 강력한 시각적 검색 능력과 정교한 언어적 추론을 결합한 다중 단계 추론 프로세스를 사용하여, 동시에 텍스트 및 시각적 RAG 파이프라인에 대한 증거 선별 및 사고 체계 추론을 수행한다. VisDoMRAG의 주요 특징 중 하나는 추론 시 일관성 제약 모달리티 퓨전 메커니즘으로, 모달리티 간 추론 프로세스를 조정하여 일관된 최종 답변을 생성한다. 이는 중요 정보가 모달리티 간 분산되어 있는 시나리오에서 향상된 정확도와 암시적 문맥 속성을 통한 답변 검증 기능을 제공한다. 오픈 소스 및 프로프리어터리 대형 언어 모델을 활용한 다양한 실험을 통해, 우리는 VisDoMBench에서 최신 문서 QA 방법을 벤치마킹한다. 광범위한 결과는 VisDoMRAG가 단일 모달 및 장문맥 LLM 베이스라인을 능가하여, 최종 다중 모달 문서 QA에서 12-20%의 성능을 보여준다.

English

Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.

VisDoM: 시각적으로 풍부한 요소를 활용한 다중 문서 QA: 다중 모달 검색 증강 생성

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

초록

Support