無需視頻模型的視頻深度

Video Depth without Video Models

November 28, 2024
作者: Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler
cs.AI

摘要

影片深度估計通過推斷每個幀的密集深度,將單眼影片提升至3D。最近單張圖像深度估計的進展,由於大型基礎模型的崛起和合成訓練數據的使用,引發了對影片深度的重新興趣。然而,將單張圖像深度估計器天真地應用於影片的每一幀中,忽略了時間連貫性,這不僅導致閃爍,還可能在攝像機運動導致深度範圍突然變化時出現問題。一個明顯且合理的解決方案是基於影片基礎模型進行擴展,但這些模型也有其局限性,包括昂貴的訓練和推斷成本、不完美的3D一致性,以及固定長度(短)輸出的拼接程序。我們退後一步,展示如何將單張圖像潛在擴散模型(LDM)轉換為最先進的影片深度估計器。我們的模型名為RollingDepth,主要包含兩個要素:(i) 從單張圖像LDM導出的多幀深度估計器,將非常短的影片片段(通常是幀三元組)映射到深度片段。(ii) 一個強大的基於優化的註冊算法,將以不同幀率採樣的深度片段最佳地組合回一個一致的影片。RollingDepth能夠高效處理包含數百幀的長影片,並提供比專用影片深度估計器和高性能單幀模型更準確的深度影片。項目頁面:rollingdepth.github.io。
English
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

Summary

AI-Generated Summary

PDF357December 2, 2024