MonST3R:在運動存在的情況下估計幾何的簡單方法
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
October 4, 2024
作者: Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, Ming-Hsuan Yang
cs.AI
摘要
從動態場景中估計幾何形狀,其中物體隨時間移動和變形,仍然是計算機視覺中的一個核心挑戰。目前的方法通常依賴於多階段流程或全局優化,將問題分解為深度和光流等子任務,導致容易出錯的複雜系統。在本文中,我們提出了Motion DUSt3R(MonST3R),一種新穎的以幾何為先的方法,直接從動態場景中估計每個時間步的幾何形狀。我們的關鍵見解是,通過簡單地為每個時間步估計一個點地圖,我們可以有效地將DUST3R的表示法適應到動態場景中,該表示法先前僅用於靜態場景。然而,這種方法面臨一個重大挑戰:適合的訓練數據稀缺,即帶有深度標籤的動態姿勢視頻。儘管如此,我們展示了通過將問題定義為微調任務,識別幾個適合的數據集,並在有限數據上策略性地訓練模型,我們可以令模型驚人地處理動態,即使沒有明確的運動表示。基於此,我們為幾個下游視頻特定任務引入了新的優化方法,並在視頻深度和相機姿態估計方面展示了強大的性能,優於以往的工作,具有更好的魯棒性和效率。此外,MonST3R對於主要的前向4D重建顯示出有希望的結果。
English
Estimating geometry from dynamic scenes, where objects move and deform over
time, remains a core challenge in computer vision. Current approaches often
rely on multi-stage pipelines or global optimizations that decompose the
problem into subtasks, like depth and flow, leading to complex systems prone to
errors. In this paper, we present Motion DUSt3R (MonST3R), a novel
geometry-first approach that directly estimates per-timestep geometry from
dynamic scenes. Our key insight is that by simply estimating a pointmap for
each timestep, we can effectively adapt DUST3R's representation, previously
only used for static scenes, to dynamic scenes. However, this approach presents
a significant challenge: the scarcity of suitable training data, namely
dynamic, posed videos with depth labels. Despite this, we show that by posing
the problem as a fine-tuning task, identifying several suitable datasets, and
strategically training the model on this limited data, we can surprisingly
enable the model to handle dynamics, even without an explicit motion
representation. Based on this, we introduce new optimizations for several
downstream video-specific tasks and demonstrate strong performance on video
depth and camera pose estimation, outperforming prior work in terms of
robustness and efficiency. Moreover, MonST3R shows promising results for
primarily feed-forward 4D reconstruction.Summary
AI-Generated Summary