ChatPaper.aiChatPaper

視覺與語言中的一個缺失環節:對漫畫理解的調查

One missing piece in Vision and Language: A Survey on Comics Understanding

September 14, 2024
作者: Emanuele Vivoli, Andrey Barsky, Mohamed Ali Souibgui, Artemis LLabres, Marco Bertini, Dimosthenis Karatzas
cs.AI

摘要

視覺語言模型最近已演變為多功能系統,能夠在各種任務中高效執行,例如文件理解、視覺問答和基礎定位,通常在零樣本設置中。漫畫理解是一個複雜且多面向的領域,將大大受益於這些進展。作為一種媒介,漫畫結合豐富的視覺和文字敘事,挑戰著 AI 模型,跨越圖像分類、物體檢測、實例分割,以及透過連續面板實現更深入的敘事理解。然而,漫畫的獨特結構 — 其特點是創意風格的變化、閱讀順序和非線性敘事 — 提出了一系列與其他視覺語言領域不同的挑戰。在這份調查中,我們從數據集和任務的角度全面回顧了漫畫理解。我們的貢獻有五個方面:(1) 我們分析了漫畫媒介的結構,詳細說明其獨特的構成要素;(2) 我們調查了漫畫研究中廣泛使用的數據集和任務,強調它們在推進該領域中的作用;(3) 我們介紹了漫畫理解層(LoCU)框架,這是一個重新定義視覺語言任務在漫畫中的新型分類法,為未來工作奠定基礎;(4) 我們根據 LoCU 框架對現有方法進行了詳細回顧和分類;(5) 最後,我們強調了當前的研究挑戰,並提出未來探索方向,特別是在將視覺語言模型應用於漫畫的情況下。這份調查是第一份提出針對漫畫智能的任務導向框架,旨在通過解決數據可用性和任務定義中的關鍵差距,引導未來研究。與此調查相關的項目可在 https://github.com/emanuelevivoli/awesome-comics-understanding 找到。
English
Vision-language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero-shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challenging AI models with tasks that span image classification, object detection, instance segmentation, and deeper narrative comprehension through sequential panels. However, the unique structure of comics -- characterized by creative variations in style, reading order, and non-linear storytelling -- presents a set of challenges distinct from those in other visual-language domains. In this survey, we present a comprehensive review of Comics Understanding from both dataset and task perspectives. Our contributions are fivefold: (1) We analyze the structure of the comics medium, detailing its distinctive compositional elements; (2) We survey the widely used datasets and tasks in comics research, emphasizing their role in advancing the field; (3) We introduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy that redefines vision-language tasks within comics and lays the foundation for future work; (4) We provide a detailed review and categorization of existing methods following the LoCU framework; (5) Finally, we highlight current research challenges and propose directions for future exploration, particularly in the context of vision-language models applied to comics. This survey is the first to propose a task-oriented framework for comics intelligence and aims to guide future research by addressing critical gaps in data availability and task definition. A project associated with this survey is available at https://github.com/emanuelevivoli/awesome-comics-understanding.

Summary

AI-Generated Summary

PDF262November 16, 2024