Free4D：无需调优的时空一致性4D场景生成

摘要

我们提出了Free4D，一种从单张图像生成4D场景的全新免调优框架。现有方法要么专注于物体级生成，使得场景级生成难以实现，要么依赖大规模多视角视频数据集进行昂贵的训练，由于4D场景数据的稀缺性，其泛化能力有限。相比之下，我们的核心洞察在于蒸馏预训练的基础模型以实现一致的4D场景表示，这带来了效率和可泛化性等显著优势。1) 为此，我们首先利用图像到视频扩散模型对输入图像进行动画化处理，随后进行4D几何结构初始化。2) 为了将这一粗略结构转化为时空一致的多视角视频，我们设计了一种自适应引导机制，结合点引导去噪策略确保空间一致性，并采用新颖的潜在替换策略保证时间连贯性。3) 为了将这些生成的观测提升为一致的4D表示，我们提出了一种基于调制的细化方法，以缓解不一致性，同时充分利用生成的信息。最终得到的4D表示支持实时可控渲染，标志着基于单张图像的4D场景生成领域的一大进步。

English

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

Free4D：无需调优的时空一致性4D场景生成

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

摘要

Summary

Support

Support