反思式规划:面向多阶段长时程机器人操作的视觉-语言模型
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
February 23, 2025
作者: Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo
cs.AI
摘要
解决复杂的长期机器人操作问题需要具备高级规划能力、对物理世界的深刻理解以及能够灵活选择适当运动技能的能力。基于互联网数据预训练的视觉语言模型(VLMs)原则上可为应对此类问题提供框架。然而,现有VLMs在机器人操作所需的精细物理理解及应对错误累积的长时推理能力方面均显不足。本文提出了一种新颖的测试时计算框架,旨在增强VLMs在多阶段操作任务中的物理推理能力。该框架的核心在于通过“反思”机制迭代优化预训练的VLM——利用生成模型预测未来世界状态,以此指导动作选择,并关键性地对潜在次优决策进行反思以精炼其推理过程。实验结果表明,我们的方法显著优于多个当前最先进的商用VLMs以及其他后训练方法,如蒙特卡洛树搜索(MCTS)。相关视频可访问https://reflect-vlm.github.io获取。
English
Solving complex long-horizon robotic manipulation problems requires
sophisticated high-level planning capabilities, the ability to reason about the
physical world, and reactively choose appropriate motor skills. Vision-language
models (VLMs) pretrained on Internet data could in principle offer a framework
for tackling such problems. However, in their current form, VLMs lack both the
nuanced understanding of intricate physics required for robotic manipulation
and the ability to reason over long horizons to address error compounding
issues. In this paper, we introduce a novel test-time computation framework
that enhances VLMs' physical reasoning capabilities for multi-stage
manipulation tasks. At its core, our approach iteratively improves a pretrained
VLM with a "reflection" mechanism - it uses a generative model to imagine
future world states, leverages these predictions to guide action selection, and
critically reflects on potential suboptimalities to refine its reasoning.
Experimental results demonstrate that our method significantly outperforms
several state-of-the-art commercial VLMs as well as other post-training
approaches such as Monte Carlo Tree Search (MCTS). Videos are available at
https://reflect-vlm.github.io.Summary
AI-Generated Summary