掩码场景建模：缩小监督学习与自监督学习在三维场景理解中的差距

摘要

自监督学习通过使模型能够在大量未标注数据集上进行训练，从而提供与有标签训练模型性能相当的通用现成特征，彻底改变了二维计算机视觉领域。然而，在三维场景理解中，自监督方法通常仅作为任务特定微调的权重初始化步骤，限制了其在通用特征提取方面的应用。本文针对这一不足，提出了一种专门设计的鲁棒评估协议，旨在评估自监督特征在三维场景理解中的质量。我们的协议采用分层模型的多分辨率特征采样，创建丰富的点级表示，这些表示能够捕捉模型的语义能力，因此适用于线性探测和最近邻方法的评估。此外，我们首次引入了一种自监督模型，在仅使用现成特征的线性探测设置下，其表现与监督模型相当。特别地，我们的模型在三维空间中以原生方式训练，采用了一种基于掩码场景建模目标的新型自监督方法，该方法自下而上地重建掩码补丁的深层特征，并专门针对分层三维模型进行了定制。我们的实验不仅证明了该方法在性能上与监督模型相当，而且大幅超越了现有的自监督方法。模型及训练代码可在我们的Github仓库中找到（https://github.com/phermosilla/msm）。

English

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

掩码场景建模：缩小监督学习与自监督学习在三维场景理解中的差距

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

摘要

Summary

Support

Support