遮蔽場景建模：縮小監督學習與自監督學習在三維場景理解中的差距

摘要

自監督學習已徹底改變了二維計算機視覺領域，它使模型能夠在大量未標註數據集上進行訓練，從而提供多功能的現成特徵，其表現與使用標籤訓練的模型相當。然而，在三維場景理解中，自監督方法通常僅作為任務特定微調的權重初始化步驟，這限制了它們在通用特徵提取中的效用。本文針對這一不足，提出了一種專門設計的穩健評估協議，用於評估自監督特徵在三維場景理解中的質量。我們的協議利用分層模型的多分辨率特徵採樣，創建豐富的點級表示，這些表示捕捉了模型的語義能力，因此適合使用線性探測和最近鄰方法進行評估。此外，我們引入了首個自監督模型，在僅使用現成特徵的線性探測設置中，其表現與監督模型相當。特別是，我們的模型在三維中進行原生訓練，採用了一種基於掩碼場景建模目標的新穎自監督方法，該方法以自下而上的方式重建掩碼補丁的深度特徵，並專門針對分層三維模型進行了定制。我們的實驗不僅展示了我們的方法在性能上與監督模型競爭，而且還大幅超越了現有的自監督方法。模型和訓練代碼可在我們的Github倉庫中找到（https://github.com/phermosilla/msm）。

English

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

遮蔽場景建模：縮小監督學習與自監督學習在三維場景理解中的差距

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

摘要

Summary

Support

Support