通过MCTS自动化结构化思维增强多模态推理

摘要

多模态大语言模型（MLLMs）展现出令人印象深刻的能力，但在复杂的视觉推理方面仍面临挑战。尽管最近的努力尝试通过整合OpenAI o1类似的结构化思维，采用显式搜索结构或教师引导的蒸馏来增强MLLMs的推理能力，但它们往往难以平衡性能和效率。一个关键限制是它们过分依赖大量数据和搜索空间，导致低效的隐式洞察提取和数据利用。为了解决这个问题，我们提出了AStar，一种通过蒙特卡洛树搜索（MCTS）进行多模态推理的自动化结构化思维范式。AStar利用MCTS驱动的分层结构从有限数据中自动推导高层认知推理模式。基于这些显式模式，我们设计了一个统一的推理框架，无缝整合模型的内部推理能力和外部推理指导，实现了在最小树迭代次数下的高效推理。这种新颖的范式在性能和效率之间取得了引人注目的平衡。大量实验表明AStar的有效性，在MathVerse基准测试中以7B骨干获得了卓越的准确性（54.0%），超过了GPT-4o（50.2%），同时保持了相当的数据和计算效率。

English

Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models' internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar's effectiveness, achieving superior accuracy (54.0%) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2%) while maintaining substantial data and computational efficiency.

通过MCTS自动化结构化思维增强多模态推理

Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

摘要

Summary

Support