ChatPaper.aiChatPaper

Zero4D:利用现成视频扩散模型实现单视频到4D视频的无训练生成

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model

March 28, 2025
作者: Jangho Park, Taesung Kwon, Jong Chul Ye
cs.AI

摘要

近期,多视角或4D视频生成已成为一个重要的研究课题。然而,现有的4D生成方法仍面临根本性局限,主要依赖于整合多个视频扩散模型并进行额外训练,或是计算密集地训练完整的4D扩散模型,但受限于现实世界4D数据的稀缺及高昂的计算成本。为应对这些挑战,本文提出了一种无需训练即可实现的4D视频生成方法,该方法利用现成的视频扩散模型,从单一输入视频生成多视角视频。我们的方法包含两个关键步骤:(1) 通过在时空采样网格中指定边缘帧为关键帧,我们首先使用视频扩散模型合成这些关键帧,并采用基于深度的变形技术进行引导。这一策略确保了生成帧间的结构一致性,保持了空间与时间的连贯性。(2) 随后,我们利用视频扩散模型对剩余帧进行插值,构建一个完整填充且时间连贯的采样网格,同时保持空间与时间的一致性。通过这一方法,我们将单一视频沿新颖的相机轨迹扩展为多视角视频,同时维持了时空一致性。我们的方法无需训练,完全利用现成的视频扩散模型,为多视角视频生成提供了一个实用且高效的解决方案。
English
Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

Summary

AI-Generated Summary

PDF182April 1, 2025