CrossOver:三维场景跨模态对齐
CrossOver: 3D Scene Cross-Modal Alignment
February 20, 2025
作者: Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni
cs.AI
摘要
多模态3D物体理解已获得广泛关注,然而现有方法通常假设所有模态数据完整且严格对齐。我们提出了CrossOver,一种通过灵活的场景级模态对齐实现跨模态3D场景理解的新框架。与需要每个物体实例模态数据对齐的传统方法不同,CrossOver通过放宽约束且无需显式物体语义,将RGB图像、点云、CAD模型、平面图及文本描述等模态对齐,学习到一个统一的、模态无关的场景嵌入空间。利用维度特定编码器、多阶段训练流程及涌现的跨模态行为,CrossOver即使在缺失某些模态的情况下,也能支持鲁棒的场景检索与物体定位。在ScanNet和3RScan数据集上的评估显示,其在多种指标上均表现出色,凸显了其在现实世界3D场景理解应用中的适应能力。
English
Multi-modal 3D object understanding has gained significant attention, yet
current approaches often assume complete data availability and rigid alignment
across all modalities. We present CrossOver, a novel framework for cross-modal
3D scene understanding via flexible, scene-level modality alignment. Unlike
traditional methods that require aligned modality data for every object
instance, CrossOver learns a unified, modality-agnostic embedding space for
scenes by aligning modalities - RGB images, point clouds, CAD models,
floorplans, and text descriptions - with relaxed constraints and without
explicit object semantics. Leveraging dimensionality-specific encoders, a
multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver
supports robust scene retrieval and object localization, even with missing
modalities. Evaluations on ScanNet and 3RScan datasets show its superior
performance across diverse metrics, highlighting adaptability for real-world
applications in 3D scene understanding.Summary
AI-Generated Summary