ChatPaper.aiChatPaper

Pippo:从单张图像生成高分辨率多视角人体模型

Pippo: High-Resolution Multi-View Humans from a Single Image

February 11, 2025
作者: Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov
cs.AI

摘要

我们提出了Pippo,这是一个生成模型,能够从单张随意拍摄的照片中生成一个人的1K分辨率密集的旋转视频。Pippo是一个多视角扩散变压器,不需要任何额外的输入,比如拟合的参数模型或输入图像的摄像机参数。我们在没有标题的30亿人类图像上对Pippo进行了预训练,并在工作室拍摄的人类身上进行了多视角中期训练和后期训练。在中期训练期间,为了快速吸收工作室数据集,我们对低分辨率下的多个视角(最多48个)进行去噪,并使用浅层MLP粗略地编码目标摄像机。在后期训练期间,我们对高分辨率下的少数视角进行去噪,并使用像素对齐的控制(例如,空间锚点和普拉克射线)来实现三维一致的生成。在推理阶段,我们提出了一种注意偏置技术,使Pippo能够同时生成超过训练过程中所见视角的5倍以上。最后,我们还引入了一个改进的度量标准来评估多视角生成的三维一致性,并展示了Pippo在从单个图像生成多视角人体时优于现有作品。
English
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.

Summary

AI-Generated Summary

PDF112February 12, 2025