MINIMA：模態不變圖像匹配

摘要

在多模態感知中，跨視角和跨模態的影像匹配扮演著至關重要的角色。在實踐中，由不同成像系統/風格引起的模態差異對匹配任務構成了巨大挑戰。現有作品試圖為特定模態提取不變特徵並在有限數據集上進行訓練，但顯示出較差的泛化能力。在本文中，我們提出了MINIMA，一個針對多種跨模態情況的統一影像匹配框架。我們的MINIMA旨在從數據擴展的角度提升通用性能，而非追求花俏的模組。為此，我們提出了一個簡單而有效的數據引擎，可以自由生成包含多種模態、豐富場景和準確匹配標籤的大型數據集。具體而言，我們通過生成模型將模態從僅包含豐富RGB匹配數據的便宜數據擴展，從而繼承了RGB數據集的匹配標籤和豐富多樣性。借助這一點，我們構建了MD-syn，一個填補一般多模態影像匹配數據差距的新綜合數據集。通過MD-syn，我們可以直接在隨機選擇的模態對上訓練任何先進的匹配管道，以獲得跨模態能力。在域內和零樣本匹配任務上進行了大量實驗，包括19個跨模態案例，結果表明我們的MINIMA可以顯著優於基線甚至超越特定模態的方法。數據集和代碼可在 https://github.com/LSXI7/MINIMA 找到。

English

Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including 19 cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at https://github.com/LSXI7/MINIMA .

MINIMA：模態不變圖像匹配

MINIMA: Modality Invariant Image Matching

摘要

Support