실제 시나리오 우선순위를 통해 야생에서 자연 이미지 매팅으로 나아가기

초록

최근의 접근 방식은 SAM과 같은 강력한 대화형 분할 모델을 대화형 매팅에 적용하고, 합성 매팅 데이터셋을 기반으로 모델을 세밀하게 조정하는 시도를 하고 있다. 그러나 합성 데이터로 훈련된 모델은 복잡하고 가려진 장면에 대한 일반화에 실패한다. 우리는 이 도전에 대처하기 위해 COCO 데이터셋을 기반으로 한 새로운 매팅 데이터셋, 즉 COCO-Matting을 제안함으로써 이 문제를 해결한다. 구체적으로, 우리의 COCO-Matting 구성에는 부속품 퓨전과 마스크-투-매팅이 포함되어 있으며, 이는 COCO에서 실제 복잡한 이미지를 선택하고 시맨틱 분할 마스크를 매팅 레이블로 변환한다. 구축된 COCO-Matting은 복잡한 자연 환경에서의 38,251개의 인스턴스 수준 알파 매팅을 포함한 방대한 컬렉션으로 구성된다. 더 나아가, 기존의 SAM 기반 매팅 방법은 얼어 붙은 SAM에서 중간 기능과 마스크를 추출하고, 엔드-투-엔드 매팅 손실에 의해 가벼운 매팅 디코더만 훈련한다. 이는 사전 훈련된 SAM의 잠재력을 완전히 활용하지 못한다. 따라서 우리는 SEMat을 제안하여 네트워크 아키텍처와 훈련 목표를 혁신한다. 네트워크 아키텍처에서 제안된 특징 정렬 트랜스포머는 세밀한 가장자리와 투명도 특징을 추출하는 방법을 학습한다. 제안된 매팅 정렬 디코더는 매팅 특정 객체를 분할하고, 굵은 마스크를 고정밀 매팅으로 변환한다. 훈련 목표에서 제안된 정규화 및 트리맵 손실은 사전 훈련된 모델에서의 사전 정보를 유지하고, 마스크 디코더에서 추출된 매팅 로짓에 트리맵 기반의 시맨틱 정보를 포함하도록 한다. 일곱 가지 다양한 데이터셋을 통한 방대한 실험은 우리의 방법의 우수한 성능을 입증하며, 대화형 자연 이미지 매팅에서의 효과를 보여준다. 우리는 https://github.com/XiaRho/SEMat에서 코드, 모델 및 데이터셋을 오픈 소스로 제공한다.

English

Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.

실제 시나리오 우선순위를 통해 야생에서 자연 이미지 매팅으로 나아가기

Towards Natural Image Matting in the Wild via Real-Scenario Prior

초록

Support