通過傳遞激活來控制語言和擴散模型。

摘要

隨著大型生成模型的能力不斷增強並且被廣泛應用，人們對其可靠性、安全性和潛在誤用提出了擔憂。為了應對這些問題，最近的研究提出通過引導模型激活來控制模型生成，以有效誘導或防止生成輸出中概念或行為的出現。本文介紹了激活傳輸（AcT），這是一個通用框架，通過最優運輸理論引導激活，擴展了許多先前的激活引導工作。AcT與模態無關，可對模型行為進行精細控制，計算開銷微乎其微，同時最小程度地影響模型能力。我們通過實驗展示了我們方法的有效性和多功能性，解決了大型語言模型（LLMs）和文本到圖像擴散模型（T2Is）中的關鍵挑戰。對於LLMs，我們展示了AcT如何有效減輕有毒性，誘導任意概念並增加其真實性。對於T2Is，我們展示了AcT如何實現精細風格控制和概念否定。

English

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

通過傳遞激活來控制語言和擴散模型。

Controlling Language and Diffusion Models by Transporting Activations

摘要

Summary

Support

Support