로켓-1: 시각-시간적 맥락을 활용한 마스터 오픈 월드 상호작용 프롬프팅

초록

비전-언어 모델(VLMs)은 다중 모달 작업에서 뛰어나지만 개방형 환경에서 실체 결정을 하는 데 적응하는 것은 도전적입니다. 핵심 문제는 저수준 관찰에서 개별 개체를 계획에 필요한 추상적인 개념과 부드럽게 연결하는 어려움입니다. 이 문제를 해결하기 위한 일반적인 접근 방식은 계층적 에이전트를 사용하는 것인데, 여기서 VLMs는 고수준 추론자로 작용하여 작업을 실행 가능한 하위 작업으로 분해하며 일반적으로 언어와 상상된 관찰을 사용하여 명시합니다. 그러나 언어는 종종 공간 정보를 효과적으로 전달하지 못하며 미래 이미지를 충분히 정확하게 생성하는 것은 여전히 어려운 문제입니다. 이러한 제한을 해결하기 위해 우리는 비전-시간적 컨텍스트 프롬프팅을 제안합니다. 이는 VLMs와 정책 모델 간의 혁신적인 통신 프로토콜로, 과거 및 현재 관찰로부터의 객체 분할을 활용하여 정책-환경 상호작용을 안내합니다. 이 접근 방식을 사용하여 우리는 ROCKET-1을 훈련시킵니다. 이는 시각적 관찰과 분할 마스크를 연결하여 행동을 예측하는 저수준 정책으로, SAM-2가 제공하는 실시간 객체 추적을 사용합니다. 우리의 방법은 VLMs의 시각-언어 추론 능력의 전체 잠재력을 발휘하여 복잡한 창의적 작업을 해결할 수 있게 하며, 특히 공간 이해에 크게 의존하는 작업을 해결할 수 있습니다. Minecraft에서의 실험은 우리의 접근 방식이 에이전트가 이전에 달성할 수 없었던 작업을 수행하도록 허용하며, 실체 결정에서 시각-시간적 컨텍스트 프롬프팅의 효과를 강조합니다. 코드 및 데모는 프로젝트 페이지에서 제공됩니다: https://craftjarvis.github.io/ROCKET-1.

English

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning. A common approach to address this problem is through the use of hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language and imagined observations. However, language often fails to effectively convey spatial information, while generating future images with sufficient accuracy remains challenging. To address these limitations, we propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from both past and present observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, with real-time object tracking provided by SAM-2. Our method unlocks the full potential of VLMs visual-language reasoning abilities, enabling them to solve complex creative tasks, especially those heavily reliant on spatial understanding. Experiments in Minecraft demonstrate that our approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making. Codes and demos will be available on the project page: https://craftjarvis.github.io/ROCKET-1.

로켓-1: 시각-시간적 맥락을 활용한 마스터 오픈 월드 상호작용 프롬프팅

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

초록

Summary

Support