ChatPaper.aiChatPaper

文本引导的稀疏体素剪枝用于高效3D视觉定位

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

February 14, 2025
作者: Wenxuan Guo, Xiuwei Xu, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
cs.AI

摘要

本文提出了一种高效的多层次卷积架构,用于三维视觉定位任务。传统方法由于采用两阶段或基于点的架构,难以满足实时推理的需求。受多层次全稀疏卷积架构在三维物体检测中成功的启发,我们旨在沿此技术路线构建一个新的三维视觉定位框架。然而,在三维视觉定位任务中,三维场景表示需要与文本特征深度交互,而基于稀疏卷积的架构因体素特征数量庞大,在此交互中效率低下。为此,我们提出了文本引导剪枝(TGP)和基于补全的添加(CBA),通过逐步区域剪枝和目标补全,高效地深度融合三维场景表示与文本特征。具体而言,TGP迭代地稀疏化三维场景表示,从而通过交叉注意力机制高效地交互体素特征与文本特征。为减轻剪枝对精细几何信息的影响,CBA通过体素补全自适应修复过度剪枝区域,且计算开销可忽略不计。与以往单阶段方法相比,我们的方法实现了顶尖的推理速度,较之前最快方法提升了100%的帧率。同时,在准确率上也达到了最先进水平,即便与两阶段方法相比,在ScanRefer数据集上Acc@0.5指标领先+1.13,在NR3D和SR3D数据集上分别领先+2.6和+3.2。代码已公开于https://github.com/GWxuan/TSP3D。
English
In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with +1.13 lead of Acc@0.5 on ScanRefer, and +2.6 and +3.2 leads on NR3D and SR3D respectively. The code is available at https://github.com/GWxuan/TSP3D{https://github.com/GWxuan/TSP3D}.

Summary

AI-Generated Summary

PDF62February 17, 2025