智能体系统的守护者:防范多轮攻击下的系统越狱风险
Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System
February 23, 2025
作者: Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehnaz Khaled, Ahmedul Kabir
cs.AI
摘要
基于大型语言模型的自主AI代理能够在社会各领域创造不可否认的价值,但它们也面临着来自对手的安全威胁,这些威胁亟待防护措施,因为信任与安全问题随之而来。考虑到多轮越狱攻击和欺骗性对齐等主要高级攻击手段,这些攻击无法通过监督训练期间使用的静态防护措施来缓解,这凸显了现实世界鲁棒性研究的关键优先级。在动态多代理系统中结合静态防护措施仍无法有效防御此类攻击。我们旨在通过开发新的评估框架来增强基于LLM的代理的安全性,该框架能够识别并应对威胁,确保安全操作部署。我们的工作采用三种检测方法:通过反向图灵测试识别恶意代理,通过多代理模拟分析欺骗性对齐,并通过工具介导的对抗场景测试GEMINI 1.5 Pro、llama-3.3-70B和deepseek r1模型,开发反越狱系统。检测能力强大,如GEMINI 1.5 Pro的准确率达到94%,但在长时间攻击下系统仍存在持续漏洞,随着提示长度增加,攻击成功率(ASR)上升,多样性指标在预测中失效,同时暴露出多个复杂系统缺陷。研究结果表明,有必要采用基于代理自身主动监控的灵活安全系统,并结合系统管理员的适应性干预,因为当前模型可能产生漏洞,导致系统不可靠且易受攻击。因此,在我们的工作中,我们尝试应对此类情况,并提出一个综合框架以应对这些安全问题。
English
The autonomous AI agents using large language models can create undeniable
values in all span of the society but they face security threats from
adversaries that warrants immediate protective solutions because trust and
safety issues arise. Considering the many-shot jailbreaking and deceptive
alignment as some of the main advanced attacks, that cannot be mitigated by the
static guardrails used during the supervised training, points out a crucial
research priority for real world robustness. The combination of static
guardrails in dynamic multi-agent system fails to defend against those attacks.
We intend to enhance security for LLM-based agents through the development of
new evaluation frameworks which identify and counter threats for safe
operational deployment. Our work uses three examination methods to detect rogue
agents through a Reverse Turing Test and analyze deceptive alignment through
multi-agent simulations and develops an anti-jailbreaking system by testing it
with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated
adversarial scenarios. The detection capabilities are strong such as 94\%
accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities
when under long attacks as prompt length increases attack success rates (ASR)
and diversity metrics become ineffective in prediction while revealing multiple
complex system faults. The findings demonstrate the necessity of adopting
flexible security systems based on active monitoring that can be performed by
the agents themselves together with adaptable interventions by system admin as
the current models can create vulnerabilities that can lead to the unreliable
and vulnerable system. So, in our work, we try to address such situations and
propose a comprehensive framework to counteract the security issues.Summary
AI-Generated Summary