ChatPaper.aiChatPaper

你看到它,你就能掌握它:在大规模无姿势视频中学习3D创作

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

December 9, 2024
作者: Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang
cs.AI

摘要

最近的3D生成模型通常依赖于有限规模的3D“金标签”或2D扩散先验用于3D内容创建。然而,由于缺乏可扩展的学习范式,它们的性能受到受限的3D先验的上限约束。在这项工作中,我们提出了See3D,这是一个在大规模互联网视频上训练的视觉条件多视角扩散模型,用于开放世界的3D创建。该模型旨在通过仅从庞大且迅速增长的视频数据中观察视觉内容来获取3D知识 -- 看到它,就能获得它。为了实现这一目标,我们首先通过一个提出的数据筛选流程扩大训练数据,该流程可以自动过滤源视频中的多视角不一致性和不足的观察。这导致了一个高质量、丰富多样、大规模的多视角图像数据集,称为WebVi3D,其中包含来自1600万视频剪辑的3.2亿帧。然而,从没有显式3D几何或摄像机姿态注释的视频中学习通用3D先验是非常困难的,而为Web规模视频注释姿态的成本是高得禁止的。为了消除对姿态条件的需求,我们引入了一种创新的视觉条件 - 通过向屏蔽视频数据添加时间相关噪声生成的纯2D归纳视觉信号。最后,我们通过将See3D集成到基于变形的流程中,为高保真度的3D生成引入了一种新颖的视觉条件3D生成框架。我们在单一和稀疏重建基准上的数字和视觉比较表明,基于成本效益和可扩展视频数据训练的See3D实现了显著的零样本和开放世界生成能力,明显优于在昂贵和受限的3D数据集上训练的模型。请参阅我们的项目页面:https://vision.baai.ac.cn/see3d
English
Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: https://vision.baai.ac.cn/see3d

Summary

AI-Generated Summary

PDF133December 10, 2024