DASH：视觉语言模型系统性幻觉的检测与评估

摘要

视觉语言模型（VLMs）易产生物体幻觉，即在图像中错误地指示某些物体的存在。现有基准通过相对较小的标注数据集来量化这种幻觉。然而，这种方法存在两个不足：其一，在VLMs广泛应用的开放世界场景中，难以全面评估幻觉现象；其二，无法有效检测VLMs中的系统性错误。为此，我们提出了DASH（系统性幻觉检测与评估），这是一个自动化、大规模的处理流程，旨在开放世界环境下识别VLMs在真实图像上的系统性幻觉。其核心组件DASH-OPT用于基于图像的检索，我们通过优化“自然图像流形”来生成误导VLM的图像。DASH的输出包含一系列真实且语义相似的图像簇，这些图像簇中VLM均产生了物体幻觉。我们将DASH应用于PaliGemma及两个LLaVA-NeXT模型，覆盖380个物体类别，共发现超过19,000个图像簇，涉及950,000张图像。我们研究了这些识别出的系统性幻觉在其他VLMs中的迁移性，并证明使用DASH获取的模型特定图像对PaliGemma进行微调，可有效缓解物体幻觉问题。代码与数据已公开于https://YanNeu.github.io/DASH。

English

Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

DASH：视觉语言模型系统性幻觉的检测与评估

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

摘要

Summary

Support

Support