视频深度全能：超长视频的一致深度估计

摘要

Depth Anything 在单目深度估计方面取得了显著的成功，具有强大的泛化能力。然而，在视频中存在时间不一致性，阻碍了其实际应用。已经提出了各种方法来缓解这一问题，通过利用视频生成模型或引入光流和摄像机姿势的先验知识。然而，这些方法仅适用于短视频（< 10秒），并且需要在质量和计算效率之间进行权衡。我们提出了Video Depth Anything，用于在超长视频（几分钟以上）中进行高质量、一致的深度估计，而不牺牲效率。我们基于Depth Anything V2 构建了我们的模型，并用高效的时空头替换了其头部。我们设计了一个简单而有效的时间一致性损失，通过限制时间深度梯度，消除了对额外几何先验的需求。该模型是在视频深度和未标记图像的联合数据集上训练的，类似于Depth Anything V2。此外，我们开发了一种基于关键帧的策略，用于长视频推断。实验证明，我们的模型可以应用于任意长的视频，而不会影响质量、一致性或泛化能力。在多个视频基准测试上进行的全面评估表明，我们的方法在零样本视频深度估计方面取得了最新的技术水平。我们提供不同规模的模型，以支持各种场景，我们最小的模型能够以30 FPS 的实时性能运行。

English

Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.

视频深度全能：超长视频的一致深度估计

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

摘要

Summary

Support

Support