通过MCTS自动化结构化思维增强多模态推理
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
February 4, 2025
作者: Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao
cs.AI
摘要
多模态大语言模型(MLLMs)展现出令人印象深刻的能力,但在复杂的视觉推理方面仍面临挑战。尽管最近的努力尝试通过整合OpenAI o1类似的结构化思维,采用显式搜索结构或教师引导的蒸馏来增强MLLMs的推理能力,但它们往往难以平衡性能和效率。一个关键限制是它们过分依赖大量数据和搜索空间,导致低效的隐式洞察提取和数据利用。为了解决这个问题,我们提出了AStar,一种通过蒙特卡洛树搜索(MCTS)进行多模态推理的自动化结构化思维范式。AStar利用MCTS驱动的分层结构从有限数据中自动推导高层认知推理模式。基于这些显式模式,我们设计了一个统一的推理框架,无缝整合模型的内部推理能力和外部推理指导,实现了在最小树迭代次数下的高效推理。这种新颖的范式在性能和效率之间取得了引人注目的平衡。大量实验表明AStar的有效性,在MathVerse基准测试中以7B骨干获得了卓越的准确性(54.0%),超过了GPT-4o(50.2%),同时保持了相当的数据和计算效率。
English
Multimodal large language models (MLLMs) exhibit impressive capabilities but
still face challenges in complex visual reasoning. While recent efforts attempt
to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking
through explicit search structures or teacher-guided distillation, they often
struggle to balance performance and efficiency. A critical limitation is their
heavy reliance on extensive data and search spaces, resulting in low-efficiency
implicit insight extraction and data utilization. To address this, we propose
AStar, an Automated Structured thinking paradigm for multimodal reasoning via
Monte Carlo Tree Search (MCTS). AStar automatically derives high-level
cognitive reasoning patterns from limited data using MCTS-powered hierarchical
structures. Building on these explicit patterns, we design a unified reasoning
framework that seamlessly integrates models' internal reasoning capabilities
and external reasoning guidelines, enabling efficient inference with minimal
tree iterations. This novel paradigm strikes a compelling balance between
performance and efficiency. Extensive experiments demonstrate AStar's
effectiveness, achieving superior accuracy (54.0%) on the MathVerse
benchmark with a 7B backbone, surpassing GPT-4o (50.2%) while maintaining
substantial data and computational efficiency.Summary
AI-Generated Summary