Zijn Vision Language Models Textuur- of Vorm-georiënteerd en Kunnen We Ze Sturen?

Samenvatting

Visie-taalmodellen (VLM's) hebben in slechts een paar jaar het landschap van computerzichtmodellen drastisch veranderd, met een opwindende reeks nieuwe toepassingen, variërend van zero-shot beeldclassificatie tot beeldbijschriften en visuele vraagbeantwoording. In tegenstelling tot pure visiemodellen bieden ze een intuïtieve manier om toegang te krijgen tot visuele inhoud via taalprompting. De brede toepasbaarheid van dergelijke modellen moedigt ons aan om ons af te vragen of ze ook in lijn zijn met menselijke visie - specifiek, in hoeverre ze door mensen geïnduceerde visuele vooroordelen aannemen via multimodale fusie, of dat ze eenvoudigweg vooroordelen erven van pure visiemodellen. Een belangrijk visueel vooroordeel is het textuur versus vorm-vooroordeel, of de dominantie van lokale boven globale informatie. In dit artikel bestuderen we dit vooroordeel in een breed scala van populaire VLM's. Interessant genoeg ontdekken we dat VLM's vaak meer vorm-georiënteerd zijn dan hun visie-encoders, wat aangeeft dat visuele vooroordelen in zekere mate worden gemoduleerd door tekst in multimodale modellen. Als tekst inderdaad visuele vooroordelen beïnvloedt, suggereert dit dat we visuele vooroordelen niet alleen kunnen sturen via visuele input, maar ook via taal: een hypothese die we bevestigen door uitgebreide experimenten. Zo kunnen we bijvoorbeeld het vormvooroordeel sturen van zo laag als 49% tot zo hoog als 72% alleen door prompting. Op dit moment blijft het sterke menselijke vooroordeel ten gunste van vorm (96%) buiten bereik voor alle geteste VLM's.

English

Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

Zijn Vision Language Models Textuur- of Vorm-georiënteerd en Kunnen We Ze Sturen?

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Samenvatting

Summary

Support