ChatPaper.aiChatPaper

VisuoThink:透過多模態樹狀搜索強化LVLM推理能力

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

April 12, 2025
作者: Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, Xipeng Qiu
cs.AI

摘要

近期,大型視覺語言模型的進展展現了卓越的能力。然而,當面對人類通常透過視覺輔助和深思熟慮的逐步思考來解決的複雜推理任務時,這些模型往往表現不佳。儘管現有方法已探索了基於文本的慢速思考或基本的視覺輔助,但它們未能捕捉到人類視覺-語言推理過程中錯綜複雜的交織特性。為克服這些限制,並受到人類認知中慢速思考機制的啟發,我們提出了VisuoThink,這是一個無縫整合視覺空間與語言領域的新框架。VisuoThink通過促進漸進式的視覺-文本推理,實現了多模態的慢速思考,並結合了透過前瞻樹搜索的測試時擴展。大量實驗表明,VisuoThink在無需微調的情況下,通過推理時擴展顯著提升了推理能力,在涉及幾何和空間推理的任務中達到了最先進的性能。
English
Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

Summary

AI-Generated Summary

PDF104April 15, 2025