細粒度焦點下的組合圖像標題生成:專注於您想要的地方

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

November 23, 2024
作者: Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang, Jianming Zhang, Jiebo Luo
cs.AI

摘要

大視覺-語言模型(VLMs)的出現顯著推動了多模式任務的發展,使得在各種應用中,包括圖像和視頻標題生成、視覺問答和跨模態檢索等方面,能夠進行更複雜和準確的推理。儘管具有卓越的能力,VLMs在細粒度圖像區域組成信息感知方面仍然存在困難。具體而言,它們難以準確地將分割遮罩與相應的語義對齊,並精確描述所指區域的組成方面。 然而,組成性——即理解和生成已知視覺和文本組件的新組合的能力——對於促進VLMs在模態之間進行連貫推理和理解至關重要。為了解決這個問題,我們提出了FINECAPTION,一種新型VLM,可以識別任意遮罩作為參考輸入,並處理高分辨率圖像,以不同粒度水平進行組成圖像標題生成。為了支持這一努力,我們引入了COMPOSITIONCAP,一個新的用於多粒度區域組成圖像標題生成的數據集,引入了組成屬性感知區域圖像標題生成任務。 實證結果顯示了我們提出的模型相對於其他最先進的VLMs的有效性。此外,我們分析了當前VLMs在識別各種視覺提示以進行組成區域圖像標題生成方面的能力,突出了VLM設計和訓練中需要改進的領域。
English
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.

Summary

AI-Generated Summary

PDF72November 27, 2024