桑葚:通过集体蒙特卡洛树搜索赋能MLLM具有类似o1的推理和反思能力
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
December 24, 2024
作者: Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, Dacheng Tao
cs.AI
摘要
在这项工作中,我们旨在开发一种理解和解决问题的MLLM,通过学习创建涉及到的推理的每个中间步骤直至最终答案。为此,我们提出了集体蒙特卡洛树搜索(CoMCTS),这是一种新的用于MLLM的学习推理方法,它将“树搜索”中引入了集体学习的概念,以实现有效和高效的推理路径搜索和学习。CoMCTS的核心思想是利用来自多个模型的集体知识,通过包括扩展、模拟和错误定位、反向传播以及选择在内的四个迭代操作,共同推测、搜索和识别通向正确答案的有效推理路径。利用CoMCTS,我们构建了Mulberry-260k,这是一个多模态数据集,为每个问题都提供了一个丰富、明确和定义良好的推理节点树。通过Mulberry-260k,我们进行了集体SFT来训练我们的模型Mulberry,这是一系列具有类似o1的逐步推理和反思能力的MLLM。大量实验证明了我们提出的方法在各种基准测试中的优越性。代码将在https://github.com/HJYao00/Mulberry 上提供。
English
In this work, we aim to develop an MLLM that understands and solves questions
by learning to create each intermediate step of the reasoning involved till the
final answer. To this end, we propose Collective Monte Carlo Tree Search
(CoMCTS), a new learning-to-reason method for MLLMs, which introduces the
concept of collective learning into ``tree search'' for effective and efficient
reasoning-path searching and learning. The core idea of CoMCTS is to leverage
collective knowledge from multiple models to collaboratively conjecture, search
and identify effective reasoning paths toward correct answers via four
iterative operations including Expansion, Simulation and Error Positioning,
Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a
multimodal dataset with a tree of rich, explicit and well-defined reasoning
nodes for each question. With Mulberry-260k, we perform collective SFT to train
our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and
Reflection capabilities. Extensive experiments demonstrate the superiority of
our proposed methods on various benchmarks. Code will be available at
https://github.com/HJYao00/MulberrySummary
AI-Generated Summary