ITACLIP: 이미지, 텍스트 및 구조적 향상을 통해 훈련 없는 시맨틱 세그멘테이션 강화

초록

최근의 기본 비전 언어 모델(VLMs)의 발전은 컴퓨터 비전 작업에서 평가 패러다임을 재정립했습니다. 특히 CLIP와 같은 이러한 기본 모델들은 Open-Vocabulary Semantic Segmentation (OVSS)를 포함한 개방 어휘의 컴퓨터 비전 작업의 연구를 가속화했습니다. 초기 결과는 유망하지만, VLMs의 밀집 예측 능력은 여전히 추가적인 개선이 필요합니다. 본 연구에서는 CLIP의 의미 분할 성능을 향상시키기 위해 새로운 모듈과 수정 사항을 도입했습니다: 1) ViT의 마지막 레이어에서의 구조적 변화 및 중간 레이어로부터의 어텐션 맵을 마지막 레이어와 통합, 2) 이미지 엔지니어링: 입력 이미지 표현을 풍부하게 하기 위해 데이터 증강을 적용, 그리고 3) 각 클래스 이름에 대한 정의와 동의어를 생성하기 위해 대형 언어 모델(LLMs)을 활용하여 CLIP의 개방 어휘 능력을 활용합니다. 저희의 훈련 무료 방법인 ITACLIP은 COCO-Stuff, COCO-Object, Pascal Context, Pascal VOC와 같은 세그멘테이션 벤치마크에서 현재 최첨단 접근법을 능가합니다. 저희의 코드는 https://github.com/m-arda-aydn/ITACLIP에서 확인하실 수 있습니다.

English

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

ITACLIP: 이미지, 텍스트 및 구조적 향상을 통해 훈련 없는 시맨틱 세그멘테이션 강화

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

초록

Summary

Support