细粒度图像标注:侧重于您想要的任何位置
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
November 23, 2024
作者: Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang, Jianming Zhang, Jiebo Luo
cs.AI
摘要
大规模视觉-语言模型(VLMs)的出现显著推动了多模态任务的发展,实现了更复杂和准确的推理,涵盖图像和视频字幕生成、视觉问答和跨模态检索等各种应用。尽管它们具有卓越的能力,但VLMs在细粒度图像区域构成信息感知方面存在困难。具体而言,它们难以准确地将分割掩模与相应的语义对齐,并精确描述所指区域的构成方面。
然而,组合性——即理解和生成已知视觉和文本组件的新组合的能力——对于促进VLMs在跨模态中进行连贯推理和理解至关重要。为了解决这一问题,我们提出了FINECAPTION,这是一种新型VLM,可以识别任意掩模作为指代输入,并处理高分辨率图像,以不同粒度水平进行构成图像字幕生成。为支持这一努力,我们引入了COMPOSITIONCAP,这是一个用于多粒度区域构成图像字幕生成的新数据集,引入了构成属性感知的区域图像字幕生成任务。
实证结果显示了我们提出的模型相对于其他最先进的VLMs的有效性。此外,我们分析了当前VLMs在识别各种视觉提示以进行构成区域图像字幕生成方面的能力,突出了VLM设计和训练中需要改进的方面。
English
The advent of large Vision-Language Models (VLMs) has significantly advanced
multimodal tasks, enabling more sophisticated and accurate reasoning across
various applications, including image and video captioning, visual question
answering, and cross-modal retrieval. Despite their superior capabilities, VLMs
struggle with fine-grained image regional composition information perception.
Specifically, they have difficulty accurately aligning the segmentation masks
with the corresponding semantics and precisely describing the compositional
aspects of the referred regions.
However, compositionality - the ability to understand and generate novel
combinations of known visual and textual components - is critical for
facilitating coherent reasoning and understanding across modalities by VLMs. To
address this issue, we propose FINECAPTION, a novel VLM that can recognize
arbitrary masks as referential inputs and process high-resolution images for
compositional image captioning at different granularity levels. To support this
endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region
compositional image captioning, which introduces the task of compositional
attribute-aware regional image captioning.
Empirical results demonstrate the effectiveness of our proposed model
compared to other state-of-the-art VLMs. Additionally, we analyze the
capabilities of current VLMs in recognizing various visual prompts for
compositional region image captioning, highlighting areas for improvement in
VLM design and training.Summary
AI-Generated Summary