MaskRIS: 시맨틱 왜곡 인식 데이터 증강을 위한 언급 이미지 분할

초록

참조 이미지 분할(RIS)은 고급 시각-언어 작업으로, 자유 형식 텍스트 설명에 따라 이미지 내 객체를 식별하고 분할하는 것을 포함합니다. 이전 연구는 시각과 언어 특징을 조정하는 데 초점을 맞추었지만, 데이터 증강과 같은 훈련 기술을 탐구하는 것은 아직 충분히 연구되지 않았습니다. 본 연구에서는 RIS를 위한 효과적인 데이터 증강을 탐구하고 Masked Referring Image Segmentation (MaskRIS)이라는 새로운 훈련 프레임워크를 제안합니다. 우리는 기존 이미지 증강이 RIS에 부족함을 발견하고, 간단한 무작위 마스킹이 RIS의 성능을 크게 향상시킨다는 것을 관찰했습니다. MaskRIS는 이미지와 텍스트 마스킹을 모두 사용하며, 왜곡 인식적 맥락 학습(DCL)을 통해 마스킹 전략의 혜택을 완전히 활용합니다. 이 접근 방식은 모델이 가려짐, 불완전 정보 및 다양한 언어적 복잡성에 대한 견고성을 향상시킬 수 있어, 상당한 성능 향상을 이끌어냅니다. 실험 결과, MaskRIS는 다양한 RIS 모델에 쉽게 적용될 수 있으며, 완전 지도 및 약 지도 설정 모두에서 기존 방법을 능가합니다. 마지막으로, MaskRIS는 RefCOCO, RefCOCO+, RefCOCOg 데이터셋에서 새로운 최고 성능을 달성합니다. 코드는 https://github.com/naver-ai/maskris에서 사용할 수 있습니다.

English

Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

MaskRIS: 시맨틱 왜곡 인식 데이터 증강을 위한 언급 이미지 분할

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

초록

Summary

Support