负面标记合并：基于图像的对抗特征引导

摘要

基于文本的对抗引导使用负面提示已经成为一种广泛采用的方法，以将输出特征远离不需要的概念。虽然有用，但仅使用文本进行对抗引导可能不足以捕捉复杂的视觉概念并避免不需要的视觉元素，如受版权保护的角色。在本文中，我们首次探索了在这个方向上使用另一种模态的方法，通过直接使用参考图像或批处理中的其他图像的视觉特征进行对抗引导。具体来说，我们引入了负面标记合并（NegToMe），这是一种简单但有效的无需训练的方法，它通过在反向扩散过程中有选择地推开匹配的语义特征（在参考和输出生成之间）来进行对抗引导。当与同一批处理中的其他图像一起使用时，我们观察到NegToMe显著增加了输出的多样性（种族、性别、视觉），而不会牺牲输出图像质量。同样，当与参考受版权保护的资产一起使用时，NegToMe有助于将与受版权内容的视觉相似性降低34.57%。NegToMe易于实现，只需几行代码，推理时间仅略高（<4%），并且适用于不原生支持单独负面提示使用的不同扩散架构，如Flux。代码可在https://negtome.github.io获得。

English

Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at https://negtome.github.io

负面标记合并：基于图像的对抗特征引导

Negative Token Merging: Image-based Adversarial Feature Guidance

摘要

Summary

Support

Support