一種靈活的大型語言模型護欄開發方法論，應用於離題提示偵測

摘要

大型語言模型容易被誤用於離題情況，使用者可能引導這些模型執行超出其預定範圍的任務。目前的防護措施通常依賴於精心挑選的示例或自定義分類器，但存在高虛警率、有限的適應性問題，以及要求現實世界數據但在預產品階段不可行的問題。本文介紹了一種靈活的、無需數據的防護措施開發方法，以應對這些挑戰。通過在質量上徹底定義問題空間並將其傳遞給一個大型語言模型以生成多樣的提示，我們構建了一個合成數據集，用於評估和訓練超越啟發式方法的離題防護措施。此外，通過將任務定義為分類用戶提示是否與系統提示相關，我們的防護措施有效地推廣到其他誤用類別，包括越獄和有害提示。最後，我們通過開源合成數據集和離題防護模型進一步貢獻於該領域，為在預產品環境中開發防護措施以及支持未來大型語言模型安全研究和開發提供了有價值的資源。

English

Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.

一種靈活的大型語言模型護欄開發方法論，應用於離題提示偵測

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

摘要

Support