AerialMegaDepth:学习空对地重建与视角合成
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
April 17, 2025
作者: Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, Shubham Tulsiani
cs.AI
摘要
我们探索了从地面和空中视角混合拍摄的图像进行几何重建的任务。当前最先进的基于学习的方法无法处理空中-地面图像对之间极端的视角变化。我们的假设是,缺乏高质量、共同配准的空中-地面数据集用于训练是这一失败的关键原因。这类数据难以精确组装,正是因为它难以以可扩展的方式进行重建。为了克服这一挑战,我们提出了一个可扩展的框架,将来自3D城市级网格(如Google Earth)的伪合成渲染与真实的地面众包图像(如MegaDepth)相结合。伪合成数据模拟了广泛的空中视角,而真实的众包图像则帮助提高了地面图像的视觉保真度,在这些地方基于网格的渲染缺乏足够的细节,从而有效弥合了真实图像与伪合成渲染之间的领域差距。利用这一混合数据集,我们对几种最先进的算法进行了微调,并在真实世界的零样本空中-地面任务中取得了显著改进。例如,我们观察到基线DUSt3R在相机旋转误差5度以内定位的空中-地面对不到5%,而使用我们的数据进行微调后,准确率提升至近56%,解决了处理大视角变化时的一个主要失败点。除了相机估计和场景重建,我们的数据集还在具有挑战性的空中-地面场景中提升了新视角合成等下游任务的性能,展示了我们的方法在实际应用中的实用价值。
English
We explore the task of geometric reconstruction of images captured from a
mixture of ground and aerial views. Current state-of-the-art learning-based
approaches fail to handle the extreme viewpoint variation between aerial-ground
image pairs. Our hypothesis is that the lack of high-quality, co-registered
aerial-ground datasets for training is a key reason for this failure. Such data
is difficult to assemble precisely because it is difficult to reconstruct in a
scalable way. To overcome this challenge, we propose a scalable framework
combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google
Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The
pseudo-synthetic data simulates a wide range of aerial viewpoints, while the
real, crowd-sourced images help improve visual fidelity for ground-level images
where mesh-based renderings lack sufficient detail, effectively bridging the
domain gap between real images and pseudo-synthetic renderings. Using this
hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve
significant improvements on real-world, zero-shot aerial-ground tasks. For
example, we observe that baseline DUSt3R localizes fewer than 5% of
aerial-ground pairs within 5 degrees of camera rotation error, while
fine-tuning with our data raises accuracy to nearly 56%, addressing a major
failure point in handling large viewpoint changes. Beyond camera estimation and
scene reconstruction, our dataset also improves performance on downstream tasks
like novel-view synthesis in challenging aerial-ground scenarios, demonstrating
the practical value of our approach in real-world applications.Summary
AI-Generated Summary