統一世界模型：結合視頻與動作擴散於大型機器人數據集上的預訓練

摘要

模仿學習已成為構建通用機器人的一種極具前景的方法。然而，由於其對高質量專家示範的依賴，將模仿學習擴展至大型機器人基礎模型仍面臨挑戰。與此同時，大量描繪廣泛環境和多樣行為的視頻數據易於獲取。這些數據提供了關於現實世界動態和智能體-環境交互的豐富信息源。然而，由於缺乏大多數當代方法所需的動作註釋，直接利用這些數據進行模仿學習已被證明是困難的。在本研究中，我們提出了統一世界模型（Unified World Models, UWM），這是一個允許利用視頻和動作數據進行策略學習的框架。具體而言，UWM在一個統一的Transformer架構中整合了動作擴散過程和視頻擴散過程，其中每個模態由獨立的擴散時間步控制。我們展示，通過簡單地控制每個擴散時間步，UWM能夠靈活地表示策略、前向動態、逆向動態和視頻生成器。通過模擬和真實世界實驗，我們表明：(1) UWM能夠在包含動態和動作預測的大規模多任務機器人數據集上進行有效的預訓練，從而產生比模仿學習更具泛化性和魯棒性的策略；(2) UWM通過獨立控制模態特定的擴散時間步，自然促進了從無動作視頻數據中學習，進一步提升了微調策略的性能。我們的結果表明，UWM為利用大型異構數據集進行可擴展的機器人學習提供了一個有希望的步驟，並在模仿學習和世界建模這兩個常常分離的範式之間實現了簡單的統一。視頻和代碼可在https://weirdlabuw.github.io/uwm/獲取。

English

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

統一世界模型：結合視頻與動作擴散於大型機器人數據集上的預訓練

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

摘要

Summary

Support

Support