ChatPaper.aiChatPaper

MaskGWM:一种基于视频掩码重建的通用驾驶世界模型

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

February 17, 2025
作者: Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
cs.AI

摘要

能够从行动中预测环境变化的世界模型对于具备强大泛化能力的自动驾驶模型至关重要。当前主流的驾驶世界模型主要基于视频预测模型构建。尽管这些模型能够利用先进的基于扩散的生成器生成高保真视频序列,但其预测时长和整体泛化能力仍受到限制。本文通过将生成损失与MAE风格的特征级上下文学习相结合,探索解决这一问题。具体而言,我们通过三个关键设计实现这一目标:(1)采用更具扩展性的扩散变换器(DiT)结构,并辅以额外的掩码构建任务进行训练。(2)设计扩散相关的掩码标记,以处理掩码重建与生成扩散过程之间的模糊关系。(3)将掩码构建任务扩展至时空域,通过行级掩码实现移位自注意力,而非MAE中的掩码自注意力。随后,我们采用行级跨视图模块以适配这一掩码设计。基于上述改进,我们提出了MaskGWM:一种融合视频掩码重建的通用驾驶世界模型。该模型包含两个变体:专注于长时预测的MaskGWM-long,以及致力于多视图生成的MaskGWM-mview。在标准基准上的全面实验验证了所提方法的有效性,包括Nuscene数据集的常规验证、OpenDV-2K数据集的长时推演以及Waymo数据集的零样本验证。这些数据集上的定量指标表明,我们的方法显著提升了当前最先进的驾驶世界模型性能。
English
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

Summary

AI-Generated Summary

PDF372February 24, 2025