反思式规划：面向多阶段长时程机器人操作的视觉-语言模型

摘要

解决复杂的长期机器人操作问题需要具备高级规划能力、对物理世界的深刻理解以及能够灵活选择适当运动技能的能力。基于互联网数据预训练的视觉语言模型（VLMs）原则上可为应对此类问题提供框架。然而，现有VLMs在机器人操作所需的精细物理理解及应对错误累积的长时推理能力方面均显不足。本文提出了一种新颖的测试时计算框架，旨在增强VLMs在多阶段操作任务中的物理推理能力。该框架的核心在于通过“反思”机制迭代优化预训练的VLM——利用生成模型预测未来世界状态，以此指导动作选择，并关键性地对潜在次优决策进行反思以精炼其推理过程。实验结果表明，我们的方法显著优于多个当前最先进的商用VLMs以及其他后训练方法，如蒙特卡洛树搜索（MCTS）。相关视频可访问https://reflect-vlm.github.io获取。

English

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

反思式规划：面向多阶段长时程机器人操作的视觉-语言模型

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

摘要

Summary

Support

Support