概念注意力：扩散变压器学习高度可解释的特征

摘要

多模态扩散Transformer（DiTs）的丰富表示是否具有增强其可解释性的独特属性？我们引入了ConceptAttention，这是一种利用DiT注意力层的表达能力生成高质量显著性地图的新方法，精确定位图像中的文本概念。ConceptAttention利用DiT注意力层的参数重新生成高度上下文化的概念嵌入，无需额外训练，这是一个重大发现，即在DiT注意力层的输出空间中执行线性投影相比常用的交叉注意力机制产生明显更清晰的显著性地图。值得注意的是，ConceptAttention甚至在零样本图像分割基准上取得了最先进的性能，在ImageNet-Segmentation数据集上和PascalVOC的单类子集上，优于其他11种零样本可解释性方法。我们的工作首次证明了像Flux这样的多模态DiT模型的表示对于分割等视觉任务是高度可迁移的，甚至优于像CLIP这样的多模态基础模型。

English

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.

概念注意力：扩散变压器学习高度可解释的特征

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

摘要

Summary

Support