URECA:獨特區域萬物描述
URECA: Unique Region Caption Anything
April 7, 2025
作者: Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim
cs.AI
摘要
區域級別圖像描述旨在為特定圖像區域生成自然語言描述,同時突出其顯著特徵。然而,現有方法在多粒度層面上難以產生獨特的描述,限制了其實際應用價值。為滿足對細粒度區域理解的需求,我們引入了URECA數據集,這是一個專為多粒度區域描述而設計的大規模數據集。與以往主要關注顯著物體的數據集不同,URECA數據集通過涵蓋多樣化的物體、部件及背景元素,確保了區域與描述之間獨特且一致的映射關係。其核心在於分階段的數據精煉流程,每個階段逐步優化區域選擇與描述生成。通過在每個階段利用多模態大語言模型(MLLMs),我們的流程產生了具有更高準確性和語義多樣性的獨特且語境化的描述。基於此數據集,我們提出了URECA模型,這是一種新穎的描述模型,旨在有效編碼多粒度區域。URECA通過對現有MLLMs進行簡單而有效的修改,保留了位置和形狀等關鍵空間屬性,從而實現了細粒度且語義豐富的區域描述。我們的方法引入了動態遮罩建模和高分辨率遮罩編碼器,以增強描述的獨特性。實驗表明,URECA在URECA數據集上達到了最先進的性能,並在現有的區域級別描述基準上展現出良好的泛化能力。
English
Region-level captioning aims to generate natural language descriptions for
specific image regions while highlighting their distinguishing features.
However, existing methods struggle to produce unique captions across
multi-granularity, limiting their real-world applicability. To address the need
for detailed region-level understanding, we introduce URECA dataset, a
large-scale dataset tailored for multi-granularity region captioning. Unlike
prior datasets that focus primarily on salient objects, URECA dataset ensures a
unique and consistent mapping between regions and captions by incorporating a
diverse set of objects, parts, and background elements. Central to this is a
stage-wise data curation pipeline, where each stage incrementally refines
region selection and caption generation. By leveraging Multimodal Large
Language Models (MLLMs) at each stage, our pipeline produces distinctive and
contextually grounded captions with improved accuracy and semantic diversity.
Building upon this dataset, we present URECA, a novel captioning model designed
to effectively encode multi-granularity regions. URECA maintains essential
spatial properties such as position and shape through simple yet impactful
modifications to existing MLLMs, enabling fine-grained and semantically rich
region descriptions. Our approach introduces dynamic mask modeling and a
high-resolution mask encoder to enhance caption uniqueness. Experiments show
that URECA achieves state-of-the-art performance on URECA dataset and
generalizes well to existing region-level captioning benchmarks.Summary
AI-Generated Summary