MCTS-자동화된 구조화된 사고를 활용한 멀티모달 추론 강화

초록

다중 모달 대형 언어 모델(Multimodal large language models, MLLMs)은 인상적인 능력을 보여주지만 여전히 복잡한 시각적 추론에서 어려움을 겪고 있습니다. 최근의 노력들은 OpenAI의 o1과 유사한 구조화된 사고를 명시적인 탐색 구조나 교사 지도 증류를 통해 MLLMs의 추론을 강화하려고 시도했지만, 종종 성능과 효율성을 균형있게 유지하는 데 어려움을 겪고 있습니다. 중요한 제한 사항은 방대한 데이터와 탐색 공간에 대한 과도한 의존으로, 낮은 효율성의 암묵적 통찰력 추출과 데이터 활용이 발생합니다. 이를 해결하기 위해 우리는 다중 모달 추론을 위한 자동 구조화 사고 패러다임인 AStar를 제안합니다. AStar는 몬테 카를로 트리 탐색(Monte Carlo Tree Search, MCTS)를 통해 제한된 데이터에서 고수준의 인지 추론 패턴을 자동으로 도출합니다. 이러한 명시적 패턴을 기반으로, 모델의 내부 추론 능력과 외부 추론 지침을 신속하게 통합하는 통합 추론 프레임워크를 설계하여, 최소한의 트리 반복으로 효율적인 추론을 가능하게 합니다. 이 새로운 패러다임은 성능과 효율성 사이에 매력적인 균형을 이룹니다. 방대한 실험 결과는 AStar의 효과를 입증하며, MathVerse 벤치마크에서 7B 백본으로 우수한 정확도(54.0%)를 달성하여 GPT-4o(50.2%)를 능가하면서 상당한 데이터 및 계산 효율성을 유지합니다.

English

Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models' internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar's effectiveness, achieving superior accuracy (54.0%) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2%) while maintaining substantial data and computational efficiency.

MCTS-자동화된 구조화된 사고를 활용한 멀티모달 추론 강화

Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

초록

Support