LLM 具有政治正確性嗎？分析 AI 系統中的道德偏見與越獄漏洞

摘要

儘管大型語言模型（LLMs）展現出在各種任務中令人印象深刻的熟練度，但它們存在潛在的安全風險，例如“越獄”，惡意輸入可能迫使LLMs生成有害內容。為了應對這些問題，許多LLM開發者已實施各種安全措施來調整這些模型。這種調整涉及多種技術，包括在預訓練期間進行數據過濾、監督微調、從人類反饋中進行強化學習以及紅隊演習。這些方法通常引入了類似政治正確性（PC）的故意偏見，以確保LLMs的道德行為。本文深入探討了為安全目的注入LLMs的故意偏見，並檢視繞過這些安全調整技術的方法。值得注意的是，這些故意偏見導致在GPT-4o模型中越獄成功率在非二元和同性別關鍵字之間相差20％，在白人和黑人關鍵字之間相差16％，即使提示的其他部分完全相同。我們引入了PCJailbreak的概念，突顯了這些安全誘發偏見所帶來的固有風險。此外，我們提出了一種有效的防禦方法PCDefense，通過在生成之前注入防禦提示，防止越獄企圖。PCDefense作為一種吸引人的替代方案，與需要在生成文本後進行額外推理成本的防護模型（如Llama-Guard）形成對比。我們的研究強調了LLM開發者在設計和實施安全措施時需要採取更負責任的方法的迫切性。

English

Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.

LLM 具有政治正確性嗎？分析 AI 系統中的道德偏見與越獄漏洞

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

摘要

Summary

Support

Support