ChatPaper.aiChatPaper

ROCKET-1:利用視覺-時間上下文促進主動式開放世界互動

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

October 23, 2024
作者: Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
cs.AI

摘要

視覺語言模型(VLMs)在多模式任務中表現出色,但將它們適應於開放世界環境中的具體決策面臨挑戰。一個關鍵問題是在低級觀察中個別實體與計劃所需的抽象概念之間平滑連接的困難。解決此問題的常見方法是使用分層代理,其中VLMs充當高級推理者,將任務分解為可執行的子任務,通常使用語言和想像觀察來指定。然而,語言通常無法有效傳達空間信息,同時生成具有足夠準確性的未來圖像仍然具有挑戰性。為了解決這些限制,我們提出了視覺-時間上下文提示,這是VLMs和策略模型之間的一種新型通信協議。該協議利用過去和現在觀察中的對象分割來引導策略-環境交互作用。使用這種方法,我們訓練了ROCKET-1,一種低級策略,它基於串聯的視覺觀察和分割遮罩來預測動作,並由SAM-2提供實時對象跟踪。我們的方法發揮了VLMs視覺語言推理能力的全部潛力,使它們能夠解決複雜的創造性任務,特別是那些嚴重依賴空間理解的任務。在Minecraft中的實驗表明,我們的方法使代理能夠完成以前難以實現的任務,突出了視覺-時間上下文提示在具體決策中的有效性。代碼和演示將在項目頁面上提供:https://craftjarvis.github.io/ROCKET-1。
English
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning. A common approach to address this problem is through the use of hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language and imagined observations. However, language often fails to effectively convey spatial information, while generating future images with sufficient accuracy remains challenging. To address these limitations, we propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from both past and present observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, with real-time object tracking provided by SAM-2. Our method unlocks the full potential of VLMs visual-language reasoning abilities, enabling them to solve complex creative tasks, especially those heavily reliant on spatial understanding. Experiments in Minecraft demonstrate that our approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making. Codes and demos will be available on the project page: https://craftjarvis.github.io/ROCKET-1.

Summary

AI-Generated Summary

PDF526November 16, 2024