Free4D:无需调优的时空一致性4D场景生成
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
March 26, 2025
作者: Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
cs.AI
摘要
我们提出了Free4D,一种从单张图像生成4D场景的全新免调优框架。现有方法要么专注于物体级生成,使得场景级生成难以实现,要么依赖大规模多视角视频数据集进行昂贵的训练,由于4D场景数据的稀缺性,其泛化能力有限。相比之下,我们的核心洞察在于蒸馏预训练的基础模型以实现一致的4D场景表示,这带来了效率和可泛化性等显著优势。1) 为此,我们首先利用图像到视频扩散模型对输入图像进行动画化处理,随后进行4D几何结构初始化。2) 为了将这一粗略结构转化为时空一致的多视角视频,我们设计了一种自适应引导机制,结合点引导去噪策略确保空间一致性,并采用新颖的潜在替换策略保证时间连贯性。3) 为了将这些生成的观测提升为一致的4D表示,我们提出了一种基于调制的细化方法,以缓解不一致性,同时充分利用生成的信息。最终得到的4D表示支持实时可控渲染,标志着基于单张图像的4D场景生成领域的一大进步。
English
We present Free4D, a novel tuning-free framework for 4D scene generation from
a single image. Existing methods either focus on object-level generation,
making scene-level generation infeasible, or rely on large-scale multi-view
video datasets for expensive training, with limited generalization ability due
to the scarcity of 4D scene data. In contrast, our key insight is to distill
pre-trained foundation models for consistent 4D scene representation, which
offers promising advantages such as efficiency and generalizability. 1) To
achieve this, we first animate the input image using image-to-video diffusion
models followed by 4D geometric structure initialization. 2) To turn this
coarse structure into spatial-temporal consistent multiview videos, we design
an adaptive guidance mechanism with a point-guided denoising strategy for
spatial consistency and a novel latent replacement strategy for temporal
coherence. 3) To lift these generated observations into consistent 4D
representation, we propose a modulation-based refinement to mitigate
inconsistencies while fully leveraging the generated information. The resulting
4D representation enables real-time, controllable rendering, marking a
significant advancement in single-image-based 4D scene generation.Summary
AI-Generated Summary