椰子-PanCap:联合全景分割和基于事实的标题,用于细粒度理解和生成
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
February 4, 2025
作者: Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
cs.AI
摘要
本文介绍了COCONut-PanCap数据集,旨在增强全景分割和基于图像的描述生成。该数据集在COCO数据集基础上构建,使用先进的COCONut全景掩模,旨在克服现有图像文本数据集的局限,这些数据集通常缺乏详细的、全面的场景描述。COCONut-PanCap数据集融入了基于全景分割掩模的细粒度、区域级描述,确保一致性并提高生成描述的细节。通过人工编辑的密集注释描述,COCONut-PanCap支持改进视觉语言模型(VLMs)的训练,用于图像理解和文本到图像任务的生成模型。实验结果表明,COCONut-PanCap显著提升了理解和生成任务的性能,为大规模数据集提供了互补优势。该数据集为评估模型在联合全景分割和基于图像的描述生成任务上设立了新的基准,满足多模态学习中高质量、详细的图像文本注释的需求。
English
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic
segmentation and grounded image captioning. Building upon the COCO dataset with
advanced COCONut panoptic masks, this dataset aims to overcome limitations in
existing image-text datasets that often lack detailed, scene-comprehensive
descriptions. The COCONut-PanCap dataset incorporates fine-grained,
region-level captions grounded in panoptic segmentation masks, ensuring
consistency and improving the detail of generated captions. Through
human-edited, densely annotated descriptions, COCONut-PanCap supports improved
training of vision-language models (VLMs) for image understanding and
generative models for text-to-image tasks. Experimental results demonstrate
that COCONut-PanCap significantly boosts performance across understanding and
generation tasks, offering complementary benefits to large-scale datasets. This
dataset sets a new benchmark for evaluating models on joint panoptic
segmentation and grounded captioning tasks, addressing the need for
high-quality, detailed image-text annotations in multi-modal learning.Summary
AI-Generated Summary