概念注意力:扩散变压器学习高度可解释的特征
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
February 6, 2025
作者: Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng Chau
cs.AI
摘要
多模态扩散Transformer(DiTs)的丰富表示是否具有增强其可解释性的独特属性?我们引入了ConceptAttention,这是一种利用DiT注意力层的表达能力生成高质量显著性地图的新方法,精确定位图像中的文本概念。ConceptAttention利用DiT注意力层的参数重新生成高度上下文化的概念嵌入,无需额外训练,这是一个重大发现,即在DiT注意力层的输出空间中执行线性投影相比常用的交叉注意力机制产生明显更清晰的显著性地图。值得注意的是,ConceptAttention甚至在零样本图像分割基准上取得了最先进的性能,在ImageNet-Segmentation数据集上和PascalVOC的单类子集上,优于其他11种零样本可解释性方法。我们的工作首次证明了像Flux这样的多模态DiT模型的表示对于分割等视觉任务是高度可迁移的,甚至优于像CLIP这样的多模态基础模型。
English
Do the rich representations of multi-modal diffusion transformers (DiTs)
exhibit unique properties that enhance their interpretability? We introduce
ConceptAttention, a novel method that leverages the expressive power of DiT
attention layers to generate high-quality saliency maps that precisely locate
textual concepts within images. Without requiring additional training,
ConceptAttention repurposes the parameters of DiT attention layers to produce
highly contextualized concept embeddings, contributing the major discovery that
performing linear projections in the output space of DiT attention layers
yields significantly sharper saliency maps compared to commonly used
cross-attention mechanisms. Remarkably, ConceptAttention even achieves
state-of-the-art performance on zero-shot image segmentation benchmarks,
outperforming 11 other zero-shot interpretability methods on the
ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our
work contributes the first evidence that the representations of multi-modal DiT
models like Flux are highly transferable to vision tasks like segmentation,
even outperforming multi-modal foundation models like CLIP.Summary
AI-Generated Summary