ChatPaper.aiChatPaper

概念引导器:利用K-稀疏自编码器进行可控生成

Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

January 31, 2025
作者: Dahye Kim, Deepti Ghadiyaram
cs.AI

摘要

尽管文本到图像生成模型取得了显著进展,但它们容易受到对抗性攻击,并无意中生成不安全、不道德的内容。现有方法通常依赖于微调模型以去除特定概念,这在计算上昂贵、缺乏可扩展性,并且可能损害生成质量。在这项工作中,我们提出了一种新颖的框架,利用 k-稀疏自编码器(k-SAEs)来实现扩散模型中的高效且可解释的概念操作。具体来说,我们首先在文本嵌入的潜在空间中识别可解释的单语义概念,并利用它们精确地引导生成远离或朝向给定概念(例如裸露)或引入新概念(例如摄影风格)。通过大量实验证明,我们的方法非常简单,无需重新训练基础模型或使用 LoRA 适配器,不会损害生成质量,并且对对抗性提示操作具有鲁棒性。我们的方法在不安全概念去除方面提高了20.01%,在风格操作方面有效,并且比当前最先进的方法快5倍。
English
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of 20.01% in unsafe concept removal, is effective in style manipulation, and is sim5x faster than current state-of-the-art.

Summary

AI-Generated Summary

PDF132February 5, 2025