看見即擁有:在大規模無姿勢影片中學習3D創作
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
December 9, 2024
作者: Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang
cs.AI
摘要
最近的3D生成模型通常依賴有限規模的3D「黃金標籤」或2D擴散先驗進行3D內容創建。然而,由於缺乏可擴展的學習範式,它們的表現受到受限3D先驗的上限約束。在這項工作中,我們提出了See3D,這是一個在大規模互聯網視頻上訓練的視覺條件多視圖擴散模型,用於開放世界的3D創建。該模型旨在通過僅從龐大且迅速增長的視頻數據中看到的視覺內容來獲取3D知識--你看到它,你就擁有它。為了實現這一目標,我們首先通過提出的數據策劃流程擴大訓練數據,該流程自動過濾源視頻中的多視圖不一致性和不足觀察。這導致了一個高質量、豐富多樣、大規模的多視圖圖像數據集,稱為WebVi3D,其中包含來自1600萬視頻剪輯的3.2億幀。然而,從沒有明確3D幾何或相機姿態標註的視頻中學習通用3D先驗是非常困難的,而為Web規模的視頻標註姿態成本過高。為了消除對姿態條件的需求,我們引入了一種創新的視覺條件--通過向遮罩視頻數據添加時間依賴性噪聲生成的純2D歸納視覺信號。最後,我們通過將See3D集成到基於變形的管道中,為高保真度的3D生成引入了一種新穎的視覺條件3D生成框架。我們在單一和稀疏重建基準上的數值和視覺比較表明,使用成本效益和可擴展的視頻數據訓練的See3D實現了顯著的零樣本和開放世界生成能力,遠遠優於在昂貴和受限制的3D數據集上訓練的模型。請參考我們的項目頁面:https://vision.baai.ac.cn/see3d
English
Recent 3D generation models typically rely on limited-scale 3D `gold-labels'
or 2D diffusion priors for 3D content creation. However, their performance is
upper-bounded by constrained 3D priors due to the lack of scalable learning
paradigms. In this work, we present See3D, a visual-conditional multi-view
diffusion model trained on large-scale Internet videos for open-world 3D
creation. The model aims to Get 3D knowledge by solely Seeing the visual
contents from the vast and rapidly growing video data -- You See it, You Got
it. To achieve this, we first scale up the training data using a proposed data
curation pipeline that automatically filters out multi-view inconsistencies and
insufficient observations from source videos. This results in a high-quality,
richly diverse, large-scale dataset of multi-view images, termed WebVi3D,
containing 320M frames from 16M video clips. Nevertheless, learning generic 3D
priors from videos without explicit 3D geometry or camera pose annotations is
nontrivial, and annotating poses for web-scale videos is prohibitively
expensive. To eliminate the need for pose conditions, we introduce an
innovative visual-condition - a purely 2D-inductive visual signal generated by
adding time-dependent noise to the masked video data. Finally, we introduce a
novel visual-conditional 3D generation framework by integrating See3D into a
warping-based pipeline for high-fidelity 3D generation. Our numerical and
visual comparisons on single and sparse reconstruction benchmarks show that
See3D, trained on cost-effective and scalable video data, achieves notable
zero-shot and open-world generation capabilities, markedly outperforming models
trained on costly and constrained 3D datasets. Please refer to our project page
at: https://vision.baai.ac.cn/see3dSummary
AI-Generated Summary