ITACLIP:通过图像、文本和架构增强提升无需训练的语义分割
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
November 18, 2024
作者: M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin
cs.AI
摘要
最近在基础视觉语言模型(VLMs)方面取得的进展已经重塑了计算机视觉任务的评估范式。这些基础模型,尤其是CLIP,加速了对开放词汇计算机视觉任务的研究,包括开放词汇语义分割(OVSS)。尽管最初的结果令人鼓舞,但VLMs的密集预测能力仍需要进一步改进。在这项研究中,我们通过引入新模块和修改来增强CLIP的语义分割性能:
1)在ViT的最后一层进行架构更改,并将中间层的注意力图与最后一层结合,
2)图像工程:应用数据增强来丰富输入图像表示,
3)使用大型语言模型(LLMs)为每个类别名称生成定义和同义词,以利用CLIP的开放词汇能力。我们的无需训练的方法,ITACLIP,在诸如COCO-Stuff、COCO-Object、Pascal Context和Pascal VOC等分割基准测试中胜过当前的最先进方法。我们的代码可在https://github.com/m-arda-aydn/ITACLIP 找到。
English
Recent advances in foundational Vision Language Models (VLMs) have reshaped
the evaluation paradigm in computer vision tasks. These foundational models,
especially CLIP, have accelerated research in open-vocabulary computer vision
tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the
initial results are promising, the dense prediction capabilities of VLMs still
require further improvement. In this study, we enhance the semantic
segmentation performance of CLIP by introducing new modules and modifications:
1) architectural changes in the last layer of ViT and the incorporation of
attention maps from the middle layers with the last layer, 2) Image
Engineering: applying data augmentations to enrich input image representations,
and 3) using Large Language Models (LLMs) to generate definitions and synonyms
for each class name to leverage CLIP's open-vocabulary capabilities. Our
training-free method, ITACLIP, outperforms current state-of-the-art approaches
on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and
Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.Summary
AI-Generated Summary