ChatPaper.aiChatPaper

3D CoCa:对比学习模型即三维场景描述生成器

3D CoCa: Contrastive Learners are 3D Captioners

April 13, 2025
作者: Ting Huang, Zeyu Zhang, Yemin Wang, Hao Tang
cs.AI

摘要

三维场景描述(3D captioning)旨在用自然语言描述三维场景内容,但由于点云固有的稀疏性以及现有方法中跨模态对齐的薄弱,这一任务仍极具挑战性。为应对这些挑战,我们提出了3D CoCa,一种新颖的统一框架,将对比式视觉-语言学习与三维场景描述生成无缝结合于单一架构之中。我们的方法利用冻结的CLIP视觉-语言骨干网络提供丰富的语义先验,通过空间感知的三维场景编码器捕捉几何上下文,并借助多模态解码器生成描述性文本。与依赖显式物体提议的两阶段方法不同,3D CoCa在共享特征空间中联合优化对比与描述目标,无需外部检测器或手工制作的提议。这种联合训练范式通过对齐三维与文本表示,实现了更强的空间推理能力和更丰富的语义基础。在ScanRefer和Nr3D基准上的大量实验表明,3D CoCa在0.5IoU下的CIDEr指标上分别显著超越当前最先进方法10.2%和5.76%。代码将发布于https://github.com/AIGeeksGroup/3DCoCa。
English
3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.

Summary

AI-Generated Summary

PDF52April 15, 2025