MaskRIS:用于指代图像分割的语义失真感知数据增强
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
November 28, 2024
作者: Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim
cs.AI
摘要
参考图像分割(RIS)是一项高级的视觉-语言任务,涉及根据自由形式文本描述识别和分割图像中的对象。虽然先前的研究侧重于对齐视觉和语言特征,但探索数据增强等训练技术仍未得到充分开发。在这项工作中,我们探讨了用于RIS的有效数据增强,并提出了一种名为蒙版参考图像分割(MaskRIS)的新型训练框架。我们观察到传统的图像增强对RIS效果不佳,导致性能下降,而简单的随机蒙版显著提升了RIS的性能。MaskRIS使用图像和文本蒙版,然后采用扭曲感知上下文学习(DCL)来充分利用蒙版策略的优势。这种方法可以提高模型对遮挡、不完整信息和各种语言复杂性的鲁棒性,从而显著提高性能。实验证明,MaskRIS可以轻松应用于各种RIS模型,在完全监督和弱监督设置中均优于现有方法。最后,MaskRIS在RefCOCO、RefCOCO+和RefCOCOg数据集上实现了新的最先进性能。代码可在https://github.com/naver-ai/maskris找到。
English
Referring Image Segmentation (RIS) is an advanced vision-language task that
involves identifying and segmenting objects within an image as described by
free-form text descriptions. While previous studies focused on aligning visual
and language features, exploring training techniques, such as data
augmentation, remains underexplored. In this work, we explore effective data
augmentation for RIS and propose a novel training framework called Masked
Referring Image Segmentation (MaskRIS). We observe that the conventional image
augmentations fall short of RIS, leading to performance degradation, while
simple random masking significantly enhances the performance of RIS. MaskRIS
uses both image and text masking, followed by Distortion-aware Contextual
Learning (DCL) to fully exploit the benefits of the masking strategy. This
approach can improve the model's robustness to occlusions, incomplete
information, and various linguistic complexities, resulting in a significant
performance improvement. Experiments demonstrate that MaskRIS can easily be
applied to various RIS models, outperforming existing methods in both fully
supervised and weakly supervised settings. Finally, MaskRIS achieves new
state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code
is available at https://github.com/naver-ai/maskris.Summary
AI-Generated Summary