统一视频动作模型
Unified Video Action Model
February 28, 2025
作者: Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song
cs.AI
摘要
统一的视频与动作模型在机器人领域展现出巨大潜力,其中视频为动作预测提供了丰富的场景信息,而动作则为视频预测带来了动态信息。然而,有效结合视频生成与动作预测仍面临挑战,当前基于视频生成的方法在动作准确性和推理速度上难以匹敌直接策略学习。为弥合这一差距,我们提出了统一视频动作模型(UVA),它通过联合优化视频与动作预测,实现了高精度与高效动作推理。关键在于学习一个联合的视频-动作潜在表示,并解耦视频-动作解码过程。这一联合潜在表示桥接了视觉与动作领域,有效建模了视频与动作序列间的关系。同时,借助两个轻量级扩散头实现的解耦解码,在推理时绕过视频生成,实现了高速动作推理。这一统一框架还通过掩码输入训练赋予了多功能性。通过选择性掩码动作或视频,单一模型能够处理策略学习之外的多样化任务,如正向与逆向动力学建模及视频生成。通过一系列广泛实验,我们证明UVA可作为机器人多种任务的通用解决方案,包括策略学习、正向/逆向动力学及视频观测预测,且在与特定应用定制方法相比时,性能毫不逊色。更多结果请访问https://unified-video-action-model.github.io/。
English
A unified video and action model holds significant promise for robotics,
where videos provide rich scene information for action prediction, and actions
provide dynamics information for video prediction. However, effectively
combining video generation and action prediction remains challenging, and
current video generation-based methods struggle to match the performance of
direct policy learning in action accuracy and inference speed. To bridge this
gap, we introduce the Unified Video Action model (UVA), which jointly optimizes
video and action predictions to achieve both high accuracy and efficient action
inference. The key lies in learning a joint video-action latent representation
and decoupling video-action decoding. The joint latent representation bridges
the visual and action domains, effectively modeling the relationship between
video and action sequences. Meanwhile, the decoupled decoding, powered by two
lightweight diffusion heads, enables high-speed action inference by bypassing
video generation during inference. Such a unified framework further enables
versatile functionality through masked input training. By selectively masking
actions or videos, a single model can tackle diverse tasks beyond policy
learning, such as forward and inverse dynamics modeling and video generation.
Via an extensive set of experiments, we demonstrate that UVA can serve as a
general-purpose solution for a wide range of robotics tasks, such as policy
learning, forward/inverse dynamics and video observation prediction, without
compromising performance compared to methods tailored for specific
applications. Results are best viewed on
https://unified-video-action-model.github.io/.Summary
AI-Generated Summary