統一世界模型:結合視頻與動作擴散於大型機器人數據集上的預訓練
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
April 3, 2025
作者: Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta
cs.AI
摘要
模仿學習已成為構建通用機器人的一種極具前景的方法。然而,由於其對高質量專家示範的依賴,將模仿學習擴展至大型機器人基礎模型仍面臨挑戰。與此同時,大量描繪廣泛環境和多樣行為的視頻數據易於獲取。這些數據提供了關於現實世界動態和智能體-環境交互的豐富信息源。然而,由於缺乏大多數當代方法所需的動作註釋,直接利用這些數據進行模仿學習已被證明是困難的。在本研究中,我們提出了統一世界模型(Unified World Models, UWM),這是一個允許利用視頻和動作數據進行策略學習的框架。具體而言,UWM在一個統一的Transformer架構中整合了動作擴散過程和視頻擴散過程,其中每個模態由獨立的擴散時間步控制。我們展示,通過簡單地控制每個擴散時間步,UWM能夠靈活地表示策略、前向動態、逆向動態和視頻生成器。通過模擬和真實世界實驗,我們表明:(1) UWM能夠在包含動態和動作預測的大規模多任務機器人數據集上進行有效的預訓練,從而產生比模仿學習更具泛化性和魯棒性的策略;(2) UWM通過獨立控制模態特定的擴散時間步,自然促進了從無動作視頻數據中學習,進一步提升了微調策略的性能。我們的結果表明,UWM為利用大型異構數據集進行可擴展的機器人學習提供了一個有希望的步驟,並在模仿學習和世界建模這兩個常常分離的範式之間實現了簡單的統一。視頻和代碼可在https://weirdlabuw.github.io/uwm/獲取。
English
Imitation learning has emerged as a promising approach towards building
generalist robots. However, scaling imitation learning for large robot
foundation models remains challenging due to its reliance on high-quality
expert demonstrations. Meanwhile, large amounts of video data depicting a wide
range of environments and diverse behaviors are readily available. This data
provides a rich source of information about real-world dynamics and
agent-environment interactions. Leveraging this data directly for imitation
learning, however, has proven difficult due to the lack of action annotation
required for most contemporary methods. In this work, we present Unified World
Models (UWM), a framework that allows for leveraging both video and action data
for policy learning. Specifically, a UWM integrates an action diffusion process
and a video diffusion process within a unified transformer architecture, where
independent diffusion timesteps govern each modality. We show that by simply
controlling each diffusion timestep, UWM can flexibly represent a policy, a
forward dynamics, an inverse dynamics, and a video generator. Through simulated
and real-world experiments, we show that: (1) UWM enables effective pretraining
on large-scale multitask robot datasets with both dynamics and action
predictions, resulting in more generalizable and robust policies than imitation
learning, (2) UWM naturally facilitates learning from action-free video data
through independent control of modality-specific diffusion timesteps, further
improving the performance of finetuned policies. Our results suggest that UWM
offers a promising step toward harnessing large, heterogeneous datasets for
scalable robot learning, and provides a simple unification between the often
disparate paradigms of imitation learning and world modeling. Videos and code
are available at https://weirdlabuw.github.io/uwm/.Summary
AI-Generated Summary