MaskRIS：用于指代图像分割的语义失真感知数据增强

摘要

参考图像分割（RIS）是一项高级的视觉-语言任务，涉及根据自由形式文本描述识别和分割图像中的对象。虽然先前的研究侧重于对齐视觉和语言特征，但探索数据增强等训练技术仍未得到充分开发。在这项工作中，我们探讨了用于RIS的有效数据增强，并提出了一种名为蒙版参考图像分割（MaskRIS）的新型训练框架。我们观察到传统的图像增强对RIS效果不佳，导致性能下降，而简单的随机蒙版显著提升了RIS的性能。MaskRIS使用图像和文本蒙版，然后采用扭曲感知上下文学习（DCL）来充分利用蒙版策略的优势。这种方法可以提高模型对遮挡、不完整信息和各种语言复杂性的鲁棒性，从而显著提高性能。实验证明，MaskRIS可以轻松应用于各种RIS模型，在完全监督和弱监督设置中均优于现有方法。最后，MaskRIS在RefCOCO、RefCOCO+和RefCOCOg数据集上实现了新的最先进性能。代码可在https://github.com/naver-ai/maskris找到。

English

Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

MaskRIS：用于指代图像分割的语义失真感知数据增强

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

摘要

Summary

Support

Support