능동 검색을 통한 점진적 다중 모달 추론

초록

다단계 다중 모달 추론 작업은 다중 모달 대형 언어 모델(MLLMs)에 대한 중요한 도전 과제를 제기하며, 이러한 시나리오에서 성능을 향상시키는 효과적인 방법을 찾는 것은 미해결된 문제입니다. 본 논문에서는 MLLMs의 추론 능력을 점진적으로 향상시키기 위해 Active Retrieval (AR) 및 Monte Carlo Tree Search (MCTS)를 통합적으로 개선하기 위한 범용 프레임워크인 AR-MCTS를 제안합니다. 저희의 접근 방식은 하이브리드 모달 검색 말뭉치에서 복잡한 추론 문제를 해결하기 위한 주요 지원 통찰을 검색하는 통합 검색 모듈의 개발으로 시작합니다. 자동화된 다중 모달 추론 검증의 간극을 좁히기 위해 MCTS 알고리즘과 활성 검색 메커니즘을 결합하여 각 단계별 주석을 자동으로 생성할 수 있도록 합니다. 이 전략은 전통적인 빔 검색 샘플링을 넘어서 각 추론 단계에 대한 주요 통찰을 동적으로 검색하여 추론 공간의 다양성과 신뢰성을 향상시킵니다. 또한, 다단계 보상 모델을 소개하여 다중 모달 추론 작업의 자동 검증을 지원하도록 점진적으로 조정합니다. 세 가지 복잡한 다중 모달 추론 벤치마크를 통한 실험 결과는 AR-MCTS 프레임워크가 다양한 다중 모달 모델의 성능을 향상시키는 데 효과적임을 확인합니다. 추가 분석에서 AR-MCTS가 샘플링 다양성과 정확도를 최적화하며 신뢰할 수 있는 다중 모달 추론을 제공함을 보여줍니다.

English

Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). Our approach begins with the development of a unified retrieval module that retrieves key supporting insights for solving complex reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in automated multimodal reasoning verification, we employ the MCTS algorithm combined with an active retrieval mechanism, which enables the automatic generation of step-wise annotations. This strategy dynamically retrieves key insights for each reasoning step, moving beyond traditional beam search sampling to improve the diversity and reliability of the reasoning space. Additionally, we introduce a process reward model that aligns progressively to support the automatic verification of multimodal reasoning tasks. Experimental results across three complex multimodal reasoning benchmarks confirm the effectiveness of the AR-MCTS framework in enhancing the performance of various multimodal models. Further analysis demonstrates that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.

능동 검색을 통한 점진적 다중 모달 추론

Progressive Multimodal Reasoning via Active Retrieval

초록

Summary

Support

Support