ChatPaper.aiChatPaper

通过传输激活来控制语言和扩散模型。

Controlling Language and Diffusion Models by Transporting Activations

October 30, 2024
作者: Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau
cs.AI

摘要

随着大型生成模型的能力不断增强以及它们的日益广泛应用,人们对其可靠性、安全性和潜在误用问题产生了担忧。为了解决这些问题,最近的研究提出通过引导模型激活来控制模型生成,以有效诱导或阻止生成输出中概念或行为的出现。本文介绍了激活传输(AcT),这是一个通用框架,通过最优输运理论引导激活,泛化了许多先前的激活引导工作。AcT不受模态限制,可以对模型行为进行细粒度控制,计算开销极小,对模型能力的影响也很小。我们通过实验证明了我们方法的有效性和多功能性,解决了大型语言模型(LLMs)和文本到图像扩散模型(T2Is)中的关键挑战。在LLMs中,我们展示了AcT可以有效减轻有毒性、诱导任意概念并增加其真实性。在T2Is中,我们展示了AcT如何实现细粒度样式控制和概念否定。
English
The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Summary

AI-Generated Summary

PDF182November 13, 2024