ChatPaper.aiChatPaper

视觉语言模型是偏向纹理还是形状,并且我们能够引导它们吗?

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

March 14, 2024
作者: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper
cs.AI

摘要

视觉语言模型(VLMs)在短短几年内彻底改变了计算机视觉模型的格局,开启了一系列新的令人兴奋的应用,从零样本图像分类,到图像字幕生成,再到视觉问答。与纯视觉模型不同,它们提供了一种直观的方式通过语言提示来访问视觉内容。这类模型的广泛适用性鼓励我们思考它们是否也与人类视觉相一致 - 具体来说,它们在多模态融合中如何采用人类引发的视觉偏见,或者它们是否仅仅继承了纯视觉模型的偏见。一个重要的视觉偏见是纹理与形状偏见,或者局部信息相对于全局信息的主导性。在本文中,我们研究了这种偏见在各种流行的VLMs中的表现。有趣的是,我们发现VLMs通常比它们的视觉编码器更倾向于形状偏见,表明视觉偏见在一定程度上通过文本在多模态模型中调节。如果文本确实影响视觉偏见,这表明我们不仅可以通过视觉输入来引导视觉偏见,还可以通过语言来引导:这一假设我们通过大量实验证实。例如,我们能够仅通过提示将形状偏见从低至49%引导至高达72%。目前,对形状的强烈人类偏见(96%)对于所有经过测试的VLMs来说仍然是遥不可及的。
English
Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

Summary

AI-Generated Summary

PDF92January 28, 2025