ChatPaper.aiChatPaper

YesBut:一個高質量的多模標註數據集,用於評估視覺語言模型對諷刺理解能力的表現。

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

September 20, 2024
作者: Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly
cs.AI

摘要

即使對於當前的視覺語言模型來說,理解諷刺和幽默也是一項具有挑戰性的任務。本文提出了具有挑戰性的任務,包括諷刺圖像檢測(檢測圖像是否具有諷刺性)、理解(生成圖像具有諷刺性的原因)和完成(在給定圖像的一半的情況下,從兩個給定的選項中選擇另一半,使得完整圖像具有諷刺性),並釋出了一個高質量的數據集 YesBut,包含 2547 張圖像,其中 1084 張為諷刺性圖像,1463 張為非諷刺性圖像,包含不同的藝術風格,以評估這些任務。數據集中的每張諷刺性圖像描繪了一個正常情景,以及一個有趣或具有諷刺性的衝突情景。儘管當前的視覺語言模型在多模態任務(如視覺問答和圖像說明)上取得了成功,但我們的基準實驗表明,這些模型在 YesBut 數據集上的提出任務中,在零樣本設置下,無論是自動評估還是人工評估,表現不佳。此外,我們釋出了一個包含 119 張真實諷刺照片的數據集,供進一步研究使用。數據集和代碼可在 https://github.com/abhi1nandy2/yesbut_dataset 上獲得。
English
Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.

Summary

AI-Generated Summary

PDF529November 16, 2024