ChatPaper.aiChatPaper

URECA:独特区域全能描述

URECA: Unique Region Caption Anything

April 7, 2025
作者: Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim
cs.AI

摘要

区域级描述生成旨在为特定图像区域生成自然语言描述,同时突出其显著特征。然而,现有方法难以在多粒度上生成独特的描述,限制了其在实际应用中的有效性。为满足对区域级细节理解的需求,我们引入了URECA数据集,这是一个专为多粒度区域描述而构建的大规模数据集。与以往主要关注显著对象的数据集不同,URECA通过整合多样化的对象、部件及背景元素,确保了区域与描述之间独特且一致的映射关系。其核心在于分阶段的数据处理流程,每一阶段逐步优化区域选择与描述生成。通过在各阶段利用多模态大语言模型(MLLMs),我们的流程能够生成具有更高准确性和语义多样性的独特且上下文相关的描述。基于此数据集,我们提出了URECA模型,该模型旨在有效编码多粒度区域信息。URECA通过对现有MLLMs进行简单而有效的修改,保留了位置和形状等关键空间属性,从而实现了细粒度且语义丰富的区域描述。我们的方法引入了动态掩码建模和高分辨率掩码编码器,以增强描述的独特性。实验表明,URECA在URECA数据集上达到了最先进的性能,并在现有区域级描述基准测试中展现出良好的泛化能力。
English
Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

Summary

AI-Generated Summary

PDF343April 8, 2025