ChatPaper.aiChatPaper

VisuoThink:通过多模态树搜索增强LVLM推理能力

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

April 12, 2025
作者: Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, Xipeng Qiu
cs.AI

摘要

近期,大规模视觉语言模型的发展展现了卓越的能力。然而,在面对人类通常借助视觉辅助和深思熟虑、逐步推理来解决的复杂任务时,这些模型往往表现欠佳。尽管现有方法已探索了基于文本的慢速思考或初步的视觉辅助,但它们未能充分捕捉人类视觉-语言推理过程中错综复杂、交织互动的本质。为突破这些限制,并受人类认知中慢速思维机制的启发,我们提出了VisuoThink,一个创新框架,它无缝整合了视觉空间与语言领域。VisuoThink通过促进渐进式的视觉-文本推理,实现了多模态的慢速思考,并引入前瞻树搜索以在测试时进行扩展。大量实验表明,VisuoThink通过推理时的扩展显著增强了推理能力,即便无需微调,也在涉及几何与空间推理的任务中达到了业界领先水平。
English
Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

Summary

AI-Generated Summary

PDF114April 15, 2025