RaVL：在精细调整的视觉-语言模型中发现和减轻虚假相关性

摘要

细调视觉-语言模型（VLMs）经常捕捉图像特征和文本属性之间的虚假相关性，导致测试时的零样本性能下降。现有方法解决虚假相关性的问题主要在全局图像级别操作，而不是直接干预细粒度图像特征，并且主要设计用于单模态设置。在这项工作中，我们提出了RaVL，通过发现和减轻使用局部图像特征而不是在全局图像级别操作来提高VLM鲁棒性的细粒度视角。给定一个经过微调的VLM，RaVL首先通过利用区域级聚类方法来识别导致零样本分类错误的精确图像特征，从而发现虚假相关性。然后，RaVL通过一种新颖的区域感知损失函数来减轻已识别的虚假相关性，使VLM在微调过程中专注于相关区域并忽略虚假关系。我们在654个具有不同模型架构、数据领域和学习虚假相关性的VLM上评估了RaVL。我们的结果表明，RaVL能够准确发现（比最接近的基线改进了191%）和减轻（最差组图像分类准确率改善了8.2%）虚假相关性。对一般领域和医学领域的VLM进行的定性评估证实了我们的发现。

English

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

RaVL：在精细调整的视觉-语言模型中发现和减轻虚假相关性

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

摘要

Summary

Support

Support