Pippo：从单张图像生成高分辨率多视角人体模型

摘要

我们提出了Pippo，这是一个生成模型，能够从单张随意拍摄的照片中生成一个人的1K分辨率密集的旋转视频。Pippo是一个多视角扩散变压器，不需要任何额外的输入，比如拟合的参数模型或输入图像的摄像机参数。我们在没有标题的30亿人类图像上对Pippo进行了预训练，并在工作室拍摄的人类身上进行了多视角中期训练和后期训练。在中期训练期间，为了快速吸收工作室数据集，我们对低分辨率下的多个视角（最多48个）进行去噪，并使用浅层MLP粗略地编码目标摄像机。在后期训练期间，我们对高分辨率下的少数视角进行去噪，并使用像素对齐的控制（例如，空间锚点和普拉克射线）来实现三维一致的生成。在推理阶段，我们提出了一种注意偏置技术，使Pippo能够同时生成超过训练过程中所见视角的5倍以上。最后，我们还引入了一个改进的度量标准来评估多视角生成的三维一致性，并展示了Pippo在从单个图像生成多视角人体时优于现有作品。

English

We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.

Pippo：从单张图像生成高分辨率多视角人体模型

Pippo: High-Resolution Multi-View Humans from a Single Image

摘要

Summary

Support