DASH: Rilevamento e Valutazione delle Allucinazioni Sistematiche nei Modelli Visivo-Linguistici

Abstract

I modelli visione-linguaggio (VLMs) sono inclini a fenomeni di allucinazione di oggetti, in cui indicano erroneamente la presenza di determinati oggetti in un'immagine. Gli attuali benchmark quantificano queste allucinazioni utilizzando dataset etichettati relativamente piccoli. Tuttavia, questo approccio è i) insufficiente per valutare le allucinazioni che si presentano in contesti open-world, dove i VLMs sono ampiamente utilizzati, e ii) inadeguato per rilevare errori sistematici nei VLMs. Proponiamo DASH (Detection and Assessment of Systematic Hallucinations), una pipeline automatica e su larga scala progettata per identificare allucinazioni sistematiche dei VLMs su immagini del mondo reale in un contesto open-world. Un componente chiave è DASH-OPT per il recupero basato su immagini, in cui ottimizziamo sulla "varietà delle immagini naturali" per generare immagini che ingannano il VLM. L'output di DASH consiste in cluster di immagini reali e semanticamente simili per le quali il VLM allucina un oggetto. Applichiamo DASH a PaliGemma e a due modelli LLaVA-NeXT su 380 classi di oggetti e, in totale, troviamo più di 19k cluster con 950k immagini. Studiamo il trasferimento delle allucinazioni sistematiche identificate ad altri VLMs e dimostriamo che il fine-tuning di PaliGemma con le immagini specifiche del modello ottenute con DASH mitiga le allucinazioni di oggetti. Il codice e i dati sono disponibili su https://YanNeu.github.io/DASH.

English

Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

DASH: Rilevamento e Valutazione delle Allucinazioni Sistematiche nei Modelli Visivo-Linguistici

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Abstract

Summary

Support

Support