大语言模型或成危险说服者：大语言模型说服安全性的实证研究

摘要

近年来，大型语言模型（LLMs）的进步使其具备了接近人类水平的说服能力。然而，这种潜力也引发了人们对LLM驱动说服安全风险的担忧，尤其是其可能通过操纵、欺骗、利用漏洞及其他有害策略进行不道德影响的潜在风险。在本研究中，我们通过两个关键方面对LLM说服安全性进行了系统性调查：（1）LLMs是否能够恰当拒绝不道德的说服任务，并在执行过程中避免采用不道德策略，包括初始说服目标看似道德中立的情况；（2）人格特质和外部压力等影响因素如何改变它们的行为。为此，我们提出了PersuSafety，这是首个全面的说服安全性评估框架，包含三个主要阶段：说服场景构建、说服对话模拟和说服安全性评估。PersuSafety涵盖了6个不同的不道德说服主题和15种常见的不道德策略。通过对8个广泛使用的LLMs进行大量实验，我们观察到大多数LLMs存在显著的安全问题，包括未能识别有害的说服任务以及采用多种不道德的说服策略。我们的研究呼吁在诸如说服等渐进式、目标导向的对话中，应更加重视提升安全对齐性。

English

Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.

大语言模型或成危险说服者：大语言模型说服安全性的实证研究

LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

摘要

Summary

Support

Support