桑樹：通過集體蒙特卡羅樹搜索賦能於MLLM的o1式推理和反思

摘要

在這項工作中，我們旨在開發一種理解並解決問題的MLLM，通過學習創建涉及每個推理過程的中間步驟直至最終答案。為此，我們提出了集體蒙特卡羅樹搜索（CoMCTS），這是一種新的用於MLLM的學習推理方法，引入了“樹搜索”中的集體學習概念，以實現有效和高效的推理路徑搜索和學習。CoMCTS的核心思想是利用來自多個模型的集體知識，通過擴展、模擬和錯誤定位、反向傳播以及選擇等四個迭代操作，共同猜測、搜索和確定通往正確答案的有效推理路徑。使用CoMCTS，我們構建了Mulberry-260k，這是一個多模態數據集，為每個問題提供了一個豐富、明確且明確定義的推理節點樹。通過Mulberry-260k，我們執行集體SFT以訓練我們的模型Mulberry，這是一系列具有類似o1的逐步推理和反思能力的MLLM。大量實驗證明了我們提出的方法在各種基準測試中的優越性。代碼將在https://github.com/HJYao00/Mulberry 上提供。

English

In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code will be available at https://github.com/HJYao00/Mulberry

桑樹：通過集體蒙特卡羅樹搜索賦能於MLLM的o1式推理和反思

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

摘要

Support