ITACLIP：通过图像、文本和架构增强提升无需训练的语义分割

摘要

最近在基础视觉语言模型（VLMs）方面取得的进展已经重塑了计算机视觉任务的评估范式。这些基础模型，尤其是CLIP，加速了对开放词汇计算机视觉任务的研究，包括开放词汇语义分割（OVSS）。尽管最初的结果令人鼓舞，但VLMs的密集预测能力仍需要进一步改进。在这项研究中，我们通过引入新模块和修改来增强CLIP的语义分割性能： 1）在ViT的最后一层进行架构更改，并将中间层的注意力图与最后一层结合， 2）图像工程：应用数据增强来丰富输入图像表示， 3）使用大型语言模型（LLMs）为每个类别名称生成定义和同义词，以利用CLIP的开放词汇能力。我们的无需训练的方法，ITACLIP，在诸如COCO-Stuff、COCO-Object、Pascal Context和Pascal VOC等分割基准测试中胜过当前的最先进方法。我们的代码可在https://github.com/m-arda-aydn/ITACLIP 找到。

English

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

ITACLIP：通过图像、文本和架构增强提升无需训练的语义分割

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

摘要

Support