SafeRoute：面向大型语言模型的高效精准安全防护之自适应模型选择

摘要

在实际应用中部署大型语言模型（LLMs）时，需要配备强大的安全防护模型来检测并阻止有害的用户提示。虽然大型安全防护模型表现出色，但其计算成本高昂。为缓解这一问题，通常采用较小的蒸馏模型，但这些模型在处理“困难”样本时往往表现不佳，而大型模型却能准确预测。我们观察到，许多输入可以由较小模型可靠处理，只有少数样本需要大型模型的能力。基于此，我们提出了SafeRoute，一种二元路由机制，用于区分困难样本与简单样本。该方法选择性地将大型安全防护模型应用于路由机制判定为困难的样本，在保持准确性的同时提升了效率，相较于单独使用大型安全防护模型具有优势。在多个基准数据集上的实验结果表明，我们的自适应模型选择显著优化了计算成本与安全性能之间的平衡，超越了相关基线方法。

English

Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.

SafeRoute：面向大型语言模型的高效精准安全防护之自适应模型选择

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

摘要

Summary

Support