通过多模态表征的跨模态对齐增强异常检测

摘要

先前关于分布外检测（OoDD）的研究主要集中在单模态模型上。近年来，随着大规模预训练视觉-语言模型（如CLIP）的出现，利用此类多模态表示通过零样本学习和提示学习策略的OoDD方法应运而生。然而，这些方法通常要么冻结预训练权重，要么仅对其进行部分微调，这对于下游数据集可能并非最优选择。本文强调，多模态微调（MMFT）能够实现显著的OoDD性能。尽管最近的一些工作展示了微调方法对OoDD的影响，但性能提升仍有巨大潜力。我们探讨了简单微调方法的局限性，分析其为何未能充分利用预训练知识。我们的实证分析表明，这一问题可能源于分布内（ID）嵌入中的模态差距。为解决此问题，我们提出了一种训练目标，通过正则化ID数据的图像与文本嵌入之间的距离来增强跨模态对齐。这一调整有助于更好地利用预训练的文本信息，通过在超球面表示空间中更紧密地对齐来自不同模态（即文本和图像）的相似语义。我们从理论上证明，所提出的正则化对应于超球面上基于能量模型的最大似然估计。利用ImageNet-1k OoD基准数据集，我们展示了我们的方法结合利用预训练知识的后处理OoDD方法（如NegLabel），显著超越了现有方法，实现了最先进的OoDD性能，并引领了ID准确率。

English

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of na\"ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

通过多模态表征的跨模态对齐增强异常检测

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

摘要

Summary

Support

Support