Easi3R：无需训练即可从DUSt3R中估计解耦运动

摘要

DUSt3R的最新进展，借助Transformer网络架构及大规模3D数据集上的直接监督，实现了静态场景下密集点云与相机参数的稳健估计。相比之下，现有4D数据集的规模与多样性有限，成为训练高度泛化4D模型的主要瓶颈。这一限制促使传统4D方法在可扩展的动态视频数据上微调3D模型，并引入光流和深度等额外几何先验。本研究中，我们另辟蹊径，提出了Easi3R，一种简单高效的免训练4D重建方法。我们的方法在推理过程中应用注意力适应，省去了从头预训练或网络微调的需求。我们发现，DUSt3R中的注意力层天然编码了相机与物体运动的丰富信息。通过细致解耦这些注意力图，我们实现了精确的动态区域分割、相机姿态估计及4D密集点云图重建。在真实世界动态视频上的广泛实验表明，我们的轻量级注意力适应方法显著超越了以往基于大量动态数据集训练或微调的最先进方法。我们的代码已公开，供研究使用，访问地址为https://easi3r.github.io/。

English

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

Easi3R：无需训练即可从DUSt3R中估计解耦运动

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

摘要

Summary

Support

Support