ITACLIP：通過圖像、文本和架構增強來提升無需訓練的語義分割

摘要

最近在基礎視覺語言模型（VLMs）方面的進展已經重塑了計算機視覺任務的評估範式。這些基礎模型，特別是CLIP，已加速了對開放詞彙計算機視覺任務的研究，包括開放詞彙語義分割（OVSS）。儘管最初的結果令人鼓舞，VLMs的密集預測能力仍需要進一步改進。在這項研究中，我們通過引入新的模塊和修改來增強CLIP的語義分割性能： 1）在ViT的最後一層進行架構變更，並將中間層的注意力映射與最後一層結合， 2）圖像工程：應用數據增強來豐富輸入圖像表示，以及 3）使用大型語言模型（LLMs）為每個類別名稱生成定義和同義詞，以利用CLIP的開放詞彙能力。我們的無需訓練的方法ITACLIP，在分割基準測試中表現優異，如COCO-Stuff、COCO-Object、Pascal Context和Pascal VOC。我們的代碼可在https://github.com/m-arda-aydn/ITACLIP 找到。

English

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

ITACLIP：通過圖像、文本和架構增強來提升無需訓練的語義分割

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

摘要

Summary

Support

Support