ROCKET-1:利用視覺-時間上下文促進主動式開放世界互動
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting
October 23, 2024
作者: Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
cs.AI
摘要
視覺語言模型(VLMs)在多模式任務中表現出色,但將它們適應於開放世界環境中的具體決策面臨挑戰。一個關鍵問題是在低級觀察中個別實體與計劃所需的抽象概念之間平滑連接的困難。解決此問題的常見方法是使用分層代理,其中VLMs充當高級推理者,將任務分解為可執行的子任務,通常使用語言和想像觀察來指定。然而,語言通常無法有效傳達空間信息,同時生成具有足夠準確性的未來圖像仍然具有挑戰性。為了解決這些限制,我們提出了視覺-時間上下文提示,這是VLMs和策略模型之間的一種新型通信協議。該協議利用過去和現在觀察中的對象分割來引導策略-環境交互作用。使用這種方法,我們訓練了ROCKET-1,一種低級策略,它基於串聯的視覺觀察和分割遮罩來預測動作,並由SAM-2提供實時對象跟踪。我們的方法發揮了VLMs視覺語言推理能力的全部潛力,使它們能夠解決複雜的創造性任務,特別是那些嚴重依賴空間理解的任務。在Minecraft中的實驗表明,我們的方法使代理能夠完成以前難以實現的任務,突出了視覺-時間上下文提示在具體決策中的有效性。代碼和演示將在項目頁面上提供:https://craftjarvis.github.io/ROCKET-1。
English
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting
them to embodied decision-making in open-world environments presents
challenges. A key issue is the difficulty in smoothly connecting individual
entities in low-level observations with abstract concepts required for
planning. A common approach to address this problem is through the use of
hierarchical agents, where VLMs serve as high-level reasoners that break down
tasks into executable sub-tasks, typically specified using language and
imagined observations. However, language often fails to effectively convey
spatial information, while generating future images with sufficient accuracy
remains challenging. To address these limitations, we propose visual-temporal
context prompting, a novel communication protocol between VLMs and policy
models. This protocol leverages object segmentation from both past and present
observations to guide policy-environment interactions. Using this approach, we
train ROCKET-1, a low-level policy that predicts actions based on concatenated
visual observations and segmentation masks, with real-time object tracking
provided by SAM-2. Our method unlocks the full potential of VLMs
visual-language reasoning abilities, enabling them to solve complex creative
tasks, especially those heavily reliant on spatial understanding. Experiments
in Minecraft demonstrate that our approach allows agents to accomplish
previously unattainable tasks, highlighting the effectiveness of
visual-temporal context prompting in embodied decision-making. Codes and demos
will be available on the project page: https://craftjarvis.github.io/ROCKET-1.Summary
AI-Generated Summary