一种灵活的大型语言模型防护栏开发方法论用于离题提示检测

摘要

大型语言模型容易被用于主题无关的错误用途，用户可能会促使这些模型执行超出其预期范围的任务。目前的防护措施通常依赖于精心筛选的示例或自定义分类器，存在高误报率、适应性有限以及需要现实世界数据但在预生产阶段不可行的问题。在本文中，我们介绍了一种灵活的、无需数据的防护措施开发方法，以解决这些挑战。通过在定性上彻底定义问题空间，并将其传递给大型语言模型生成多样化提示，我们构建了一个合成数据集，用于评估和训练优于启发式方法的主题无关防护措施。此外，通过将任务框定为分类用户提示是否与系统提示相关，我们的防护措施有效地推广到其他错误用途类别，包括越狱和有害提示。最后，我们通过开源合成数据集和主题无关防护模型进一步为该领域做出贡献，为在预生产环境中开发防护措施以及支持未来大型语言模型安全研究和开发提供了宝贵资源。

English

Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.

一种灵活的大型语言模型防护栏开发方法论用于离题提示检测

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

摘要

Summary

Support

Support