3D CoCa：對比學習者即三維描述生成器

摘要

三維場景描述（3D captioning）旨在以自然語言描述三維場景的內容，由於點雲數據的固有稀疏性以及現有方法中跨模態對齊的薄弱，這一任務仍面臨巨大挑戰。為應對這些挑戰，我們提出了3D CoCa，這是一個新穎的統一框架，將對比視覺語言學習與三維描述生成無縫結合於單一架構中。我們的方法利用凍結的CLIP視覺語言骨幹提供豐富的語義先驗，一個空間感知的三維場景編碼器捕捉幾何上下文，以及一個多模態解碼器生成描述性文本。與依賴顯式物體提議的先前兩階段方法不同，3D CoCa在共享特徵空間中聯合優化對比和描述目標，消除了對外部檢測器或手工提議的需求。這種聯合訓練範式通過對齊三維與文本表示，實現了更強的空間推理和更豐富的語義基礎。在ScanRefer和Nr3D基準上的大量實驗表明，3D CoCa在0.5IoU下的CIDEr得分分別顯著超越當前最先進技術10.2%和5.76%。代碼將於https://github.com/AIGeeksGroup/3DCoCa 提供。

English

3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.

3D CoCa：對比學習者即三維描述生成器

3D CoCa: Contrastive Learners are 3D Captioners

摘要

Summary

Support

Support