Pippo:从单张图像生成高分辨率多视角人体模型
Pippo: High-Resolution Multi-View Humans from a Single Image
February 11, 2025
作者: Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov
cs.AI
摘要
我们提出了Pippo,这是一个生成模型,能够从单张随意拍摄的照片中生成一个人的1K分辨率密集的旋转视频。Pippo是一个多视角扩散变压器,不需要任何额外的输入,比如拟合的参数模型或输入图像的摄像机参数。我们在没有标题的30亿人类图像上对Pippo进行了预训练,并在工作室拍摄的人类身上进行了多视角中期训练和后期训练。在中期训练期间,为了快速吸收工作室数据集,我们对低分辨率下的多个视角(最多48个)进行去噪,并使用浅层MLP粗略地编码目标摄像机。在后期训练期间,我们对高分辨率下的少数视角进行去噪,并使用像素对齐的控制(例如,空间锚点和普拉克射线)来实现三维一致的生成。在推理阶段,我们提出了一种注意偏置技术,使Pippo能够同时生成超过训练过程中所见视角的5倍以上。最后,我们还引入了一个改进的度量标准来评估多视角生成的三维一致性,并展示了Pippo在从单个图像生成多视角人体时优于现有作品。
English
We present Pippo, a generative model capable of producing 1K resolution dense
turnaround videos of a person from a single casually clicked photo. Pippo is a
multi-view diffusion transformer and does not require any additional inputs -
e.g., a fitted parametric model or camera parameters of the input image. We
pre-train Pippo on 3B human images without captions, and conduct multi-view
mid-training and post-training on studio captured humans. During mid-training,
to quickly absorb the studio dataset, we denoise several (up to 48) views at
low-resolution, and encode target cameras coarsely using a shallow MLP. During
post-training, we denoise fewer views at high-resolution and use pixel-aligned
controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent
generations. At inference, we propose an attention biasing technique that
allows Pippo to simultaneously generate greater than 5 times as many views as
seen during training. Finally, we also introduce an improved metric to evaluate
3D consistency of multi-view generations, and show that Pippo outperforms
existing works on multi-view human generation from a single image.Summary
AI-Generated Summary