ChatPaper.aiChatPaper

视觉语言模型在理解图像变换方面的局限性

On the Limitations of Vision-Language Models in Understanding Image Transforms

March 12, 2025
作者: Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz
cs.AI

摘要

视觉语言模型(VLMs)在多种下游任务中展现出显著潜力,包括图像/视频生成、视觉问答、多模态聊天机器人以及视频理解。然而,这些模型在处理基本图像变换时往往表现欠佳。本文深入研究了VLMs在图像层面的理解能力,特别是OpenAI的CLIP和Google的SigLIP模型。我们的发现表明,这些模型对多种图像层面的增强处理缺乏理解。为支持此项研究,我们创建了Flickr8k数据集的增强版本,为每张图像配以所应用变换的详细描述。我们进一步探讨了这种缺陷如何影响下游任务,尤其是在图像编辑领域,并评估了当前最先进的Image2Image模型在简单变换任务上的表现。
English
Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

Summary

AI-Generated Summary

PDF11March 14, 2025