ROCKET-1:利用视觉-时间上下文掌握开放世界互动 提示
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting
October 23, 2024
作者: Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
cs.AI
摘要
视觉语言模型(VLMs)在多模态任务中表现出色,但将它们调整到开放世界环境中的具体决策中存在挑战。一个关键问题是在低层次观察中平滑连接个体实体与规划所需的抽象概念之间的困难。解决这一问题的常见方法是使用分层代理,其中VLMs充当高层推理者,将任务分解为可执行的子任务,通常使用语言和想象的观察来指定。然而,语言通常无法有效传达空间信息,同时生成具有足够准确性的未来图像仍具挑战性。为了解决这些限制,我们提出了视觉-时间上下文提示,这是VLMs和策略模型之间的一种新颖通信协议。该协议利用过去和现在观察的对象分割来引导策略-环境交互。使用这种方法,我们训练了ROCKET-1,这是一个基于连接的视觉观察和分割掩模预测动作的低级策略,实时对象跟踪由SAM-2提供。我们的方法释放了VLMs视觉-语言推理能力的全部潜力,使它们能够解决复杂的创造性任务,特别是那些严重依赖空间理解的任务。在Minecraft中的实验表明,我们的方法使代理能够完成以前无法实现的任务,突显了视觉-时间上下文提示在具体决策中的有效性。代码和演示将在项目页面上提供:https://craftjarvis.github.io/ROCKET-1。
English
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting
them to embodied decision-making in open-world environments presents
challenges. A key issue is the difficulty in smoothly connecting individual
entities in low-level observations with abstract concepts required for
planning. A common approach to address this problem is through the use of
hierarchical agents, where VLMs serve as high-level reasoners that break down
tasks into executable sub-tasks, typically specified using language and
imagined observations. However, language often fails to effectively convey
spatial information, while generating future images with sufficient accuracy
remains challenging. To address these limitations, we propose visual-temporal
context prompting, a novel communication protocol between VLMs and policy
models. This protocol leverages object segmentation from both past and present
observations to guide policy-environment interactions. Using this approach, we
train ROCKET-1, a low-level policy that predicts actions based on concatenated
visual observations and segmentation masks, with real-time object tracking
provided by SAM-2. Our method unlocks the full potential of VLMs
visual-language reasoning abilities, enabling them to solve complex creative
tasks, especially those heavily reliant on spatial understanding. Experiments
in Minecraft demonstrate that our approach allows agents to accomplish
previously unattainable tasks, highlighting the effectiveness of
visual-temporal context prompting in embodied decision-making. Codes and demos
will be available on the project page: https://craftjarvis.github.io/ROCKET-1.Summary
AI-Generated Summary