SafeRoute:面向大型语言模型的高效精准安全防护之自适应模型选择
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
February 18, 2025
作者: Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang
cs.AI
摘要
在实际应用中部署大型语言模型(LLMs)时,需要配备强大的安全防护模型来检测并阻止有害的用户提示。虽然大型安全防护模型表现出色,但其计算成本高昂。为缓解这一问题,通常采用较小的蒸馏模型,但这些模型在处理“困难”样本时往往表现不佳,而大型模型却能准确预测。我们观察到,许多输入可以由较小模型可靠处理,只有少数样本需要大型模型的能力。基于此,我们提出了SafeRoute,一种二元路由机制,用于区分困难样本与简单样本。该方法选择性地将大型安全防护模型应用于路由机制判定为困难的样本,在保持准确性的同时提升了效率,相较于单独使用大型安全防护模型具有优势。在多个基准数据集上的实验结果表明,我们的自适应模型选择显著优化了计算成本与安全性能之间的平衡,超越了相关基线方法。
English
Deploying large language models (LLMs) in real-world applications requires
robust safety guard models to detect and block harmful user prompts. While
large safety guard models achieve strong performance, their computational cost
is substantial. To mitigate this, smaller distilled models are used, but they
often underperform on "hard" examples where the larger model provides accurate
predictions. We observe that many inputs can be reliably handled by the smaller
model, while only a small fraction require the larger model's capacity.
Motivated by this, we propose SafeRoute, a binary router that distinguishes
hard examples from easy ones. Our method selectively applies the larger safety
guard model to the data that the router considers hard, improving efficiency
while maintaining accuracy compared to solely using the larger safety guard
model. Experimental results on multiple benchmark datasets demonstrate that our
adaptive model selection significantly enhances the trade-off between
computational cost and safety performance, outperforming relevant baselines.Summary
AI-Generated Summary