ChatPaper.aiChatPaper

统一世界模型:视频与动作扩散耦合在大规模机器人数据集上的预训练

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

April 3, 2025
作者: Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta
cs.AI

摘要

模仿学习已成为构建通用机器人的一种前景广阔的方法。然而,由于其对高质量专家演示的依赖,将模仿学习扩展到大规模机器人基础模型仍面临挑战。与此同时,大量描绘广泛环境和多样行为的视频数据易于获取。这些数据提供了关于现实世界动态和智能体-环境交互的丰富信息来源。然而,由于缺乏大多数现代方法所需的动作标注,直接利用这些数据进行模仿学习已被证明是困难的。在本研究中,我们提出了统一世界模型(UWM),一个能够同时利用视频和动作数据进行策略学习的框架。具体而言,UWM在统一的Transformer架构中整合了动作扩散过程和视频扩散过程,其中独立的扩散时间步长控制每种模态。我们展示,通过简单地控制每个扩散时间步长,UWM可以灵活地表示策略、前向动态、逆向动态和视频生成器。通过模拟和真实世界的实验,我们表明:(1)UWM能够在大规模多任务机器人数据集上进行有效的预训练,包括动态和动作预测,从而产生比模仿学习更具泛化性和鲁棒性的策略;(2)UWM通过独立控制特定模态的扩散时间步长,自然促进了从无动作视频数据中学习,进一步提升了微调策略的性能。我们的结果表明,UWM为利用大规模异构数据集进行可扩展的机器人学习提供了一个有希望的步骤,并在模仿学习和世界建模这两个常常分离的范式之间实现了简单的统一。视频和代码可在https://weirdlabuw.github.io/uwm/获取。
English
Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

Summary

AI-Generated Summary

PDF42April 10, 2025