ChatPaper.aiChatPaper

多视图等变性通过最小特征微调改善了三维对应关系理解。

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

November 29, 2024
作者: Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas
cs.AI

摘要

视觉基础模型,特别是ViT系列,通过提供丰富的语义特征,彻底改变了图像理解。然而,尽管它们在2D理解方面取得了成功,但它们在把握3D空间关系方面的能力仍然不清楚。在这项工作中,我们评估并增强了基于ViT的模型的3D意识。我们首先系统评估它们学习3D等变特征的能力,特别是检查不同视角下语义嵌入的一致性。我们的研究结果表明,改进的3D等变性可以提高在各种下游任务上的性能,包括姿势估计、跟踪和语义转移。基于这一发现,我们提出了一种简单而有效的基于3D对应关系的微调策略,显著增强了现有视觉模型对3D对应关系的理解。值得注意的是,即使仅对单个对象进行一次迭代的微调,也会带来显著的性能提升。所有代码和资源将公开提供,以支持对3D感知视觉模型的进一步改进。我们的代码可在https://github.com/qq456cvb/3DCorrEnhance 上找到。
English
Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

Summary

AI-Generated Summary

PDF62January 27, 2025