利用异构掩码自回归学习真实世界动作视频动态
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression
February 6, 2025
作者: Lirui Wang, Kevin Zhao, Chaoqi Liu, Xinlei Chen
cs.AI
摘要
我们提出了用于建模动作视频动态的异构掩码自回归(Heterogeneous Masked Autoregression,HMA)方法,以生成高质量数据,并在扩展机器人学习中进行评估。为机器人技术构建交互式视频世界模型和策略具有挑战性,因为需要处理多样化的场景设置,同时保持计算效率以实时运行。HMA利用来自不同机器人实体、领域和任务的观察和动作序列进行异构预训练。HMA采用掩码自回归来生成视频预测的量化或软标记。相较于先前的机器人视频生成模型,\ourshort 在视觉保真度和可控性方面表现更好,在现实世界中运行速度快15倍。在后期训练后,该模型可用作从低级动作输入生成视频模拟器,用于评估策略和生成合成数据。有关更多信息,请访问以下链接:https://liruiw.github.io/hma。
English
We propose Heterogeneous Masked Autoregression (HMA) for modeling
action-video dynamics to generate high-quality data and evaluation in scaling
robot learning. Building interactive video world models and policies for
robotics is difficult due to the challenge of handling diverse settings while
maintaining computational efficiency to run in real time. HMA uses
heterogeneous pre-training from observations and action sequences across
different robotic embodiments, domains, and tasks. HMA uses masked
autoregression to generate quantized or soft tokens for video predictions.
\ourshort achieves better visual fidelity and controllability than the previous
robotic video generation models with 15 times faster speed in the real world.
After post-training, this model can be used as a video simulator from low-level
action inputs for evaluating policies and generating synthetic data. See this
link https://liruiw.github.io/hma for more information.Summary
AI-Generated Summary