4Real-Video：学习通用的逼真4D视频扩散

摘要

我们提出了4Real-Video，一个用于生成4D视频的新颖框架，以视频帧的网格形式组织，具有时间和视点轴。在这个网格中，每一行包含共享相同时间步的帧，而每一列包含来自相同视点的帧。我们提出了一种新颖的双流架构。一个流在列上执行视点更新，另一个流在行上执行时间更新。在每个扩散变换层之后，一个同步层在两个令牌流之间交换信息。我们提出了同步层的两种实现，分别使用硬同步或软同步。这种前馈架构在三个方面改进了以前的工作：更高的推理速度，增强的视觉质量（通过FVD、CLIP和VideoScore衡量），以及改善的时间和视点一致性（通过VideoScore和Dust3R-Confidence衡量）。

English

We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization. This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore and Dust3R-Confidence).

4Real-Video：学习通用的逼真4D视频扩散

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

摘要

Summary

Support

Support