3D CoCa:對比學習者即三維描述生成器
3D CoCa: Contrastive Learners are 3D Captioners
April 13, 2025
作者: Ting Huang, Zeyu Zhang, Yemin Wang, Hao Tang
cs.AI
摘要
三維場景描述(3D captioning)旨在以自然語言描述三維場景的內容,由於點雲數據的固有稀疏性以及現有方法中跨模態對齊的薄弱,這一任務仍面臨巨大挑戰。為應對這些挑戰,我們提出了3D CoCa,這是一個新穎的統一框架,將對比視覺語言學習與三維描述生成無縫結合於單一架構中。我們的方法利用凍結的CLIP視覺語言骨幹提供豐富的語義先驗,一個空間感知的三維場景編碼器捕捉幾何上下文,以及一個多模態解碼器生成描述性文本。與依賴顯式物體提議的先前兩階段方法不同,3D CoCa在共享特徵空間中聯合優化對比和描述目標,消除了對外部檢測器或手工提議的需求。這種聯合訓練範式通過對齊三維與文本表示,實現了更強的空間推理和更豐富的語義基礎。在ScanRefer和Nr3D基準上的大量實驗表明,3D CoCa在0.5IoU下的CIDEr得分分別顯著超越當前最先進技術10.2%和5.76%。代碼將於https://github.com/AIGeeksGroup/3DCoCa 提供。
English
3D captioning, which aims to describe the content of 3D scenes in natural
language, remains highly challenging due to the inherent sparsity of point
clouds and weak cross-modal alignment in existing methods. To address these
challenges, we propose 3D CoCa, a novel unified framework that seamlessly
combines contrastive vision-language learning with 3D caption generation in a
single architecture. Our approach leverages a frozen CLIP vision-language
backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to
capture geometric context, and a multi-modal decoder to generate descriptive
captions. Unlike prior two-stage methods that rely on explicit object
proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a
shared feature space, eliminating the need for external detectors or
handcrafted proposals. This joint training paradigm yields stronger spatial
reasoning and richer semantic grounding by aligning 3D and textual
representations. Extensive experiments on the ScanRefer and Nr3D benchmarks
demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by
10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at
https://github.com/AIGeeksGroup/3DCoCa.Summary
AI-Generated Summary