DINO-WM: 사전 훈련된 시각적 특징을 활용한 World Models는 영점 계획을 가능하게 합니다.

초록

제어 조작을 고려할 때 미래 결과를 예측하는 능력은 물리적 추론에 기본적입니다. 그러나 이러한 예측 모델인 종종 세계 모델이라고 불리는 것은 학습하기 어려워서 온라인 정책 학습을 위해 주로 개발되는 작업별 솔루션에 대한 도전을 겪고 있습니다. 우리는 세계 모델의 진정한 잠재력이 다양한 문제를 오로지 수동 데이터만을 사용하여 추론하고 계획하는 능력에 있다고 주장합니다. 구체적으로, 우리는 세계 모델이 다음 세 가지 특성을 가져야 한다고 주장합니다: 1) 오프라인으로 사전 수집된 경로에서 훈련 가능해야 합니다, 2) 테스트 시간 동작 최적화를 지원해야 하며, 3) 작업에 중립적인 추론을 용이하게 해야 합니다. 이를 실현하기 위해, 우리는 시각적 동역학을 모델링하는 새로운 방법인 DINO World Model (DINO-WM)을 제안합니다. DINO-WM은 시각적 세계를 재구성하지 않고 DINOv2로 사전 훈련된 공간 패치 특징을 활용하여 오프라인 행동 경로에서 미래 패치 특징을 예측함으로써 학습할 수 있습니다. 이 설계는 DINO-WM이 행동 순서 최적화를 통해 관측 목표를 달성하고, 원하는 목표 패치 특징을 예측 대상으로 취급하여 작업에 중립적인 행동 계획을 용이하게 합니다. 우리는 미로 탐색, 탁상 밀기, 입자 조작을 포함한 다양한 도메인에서 DINO-WM을 평가합니다. 우리의 실험은 DINO-WM이 전문가 데모, 보상 모델링 또는 사전 학습된 역 모델에 의존하지 않고 테스트 시간에 제로샷 행동 솔루션을 생성할 수 있음을 보여줍니다. 특히, DINO-WM은 다양한 작업군에 적응하여 임의로 구성된 미로, 다양한 물체 모양으로 밀기 조작, 다중 입자 시나리오와 같은 작업군에 대해 강력한 일반화 능력을 나타냅니다.

English

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, have proven challenging to learn and are typically developed for task-specific solutions with online policy learning. We argue that the true potential of world models lies in their ability to reason and plan across diverse problems using only passive data. Concretely, we require world models to have the following three properties: 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To realize this, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This design allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic behavior planning by treating desired goal patch features as prediction targets. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation. Our experiments demonstrate that DINO-WM can generate zero-shot behavioral solutions at test time without relying on expert demonstrations, reward modeling, or pre-learned inverse models. Notably, DINO-WM exhibits strong generalization capabilities compared to prior state-of-the-art work, adapting to diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

DINO-WM: 사전 훈련된 시각적 특징을 활용한 World Models는 영점 계획을 가능하게 합니다.

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

초록

Support