智能体系统的守护者：防范多轮攻击下的系统越狱风险

摘要

基于大型语言模型的自主AI代理能够在社会各领域创造不可否认的价值，但它们也面临着来自对手的安全威胁，这些威胁亟待防护措施，因为信任与安全问题随之而来。考虑到多轮越狱攻击和欺骗性对齐等主要高级攻击手段，这些攻击无法通过监督训练期间使用的静态防护措施来缓解，这凸显了现实世界鲁棒性研究的关键优先级。在动态多代理系统中结合静态防护措施仍无法有效防御此类攻击。我们旨在通过开发新的评估框架来增强基于LLM的代理的安全性，该框架能够识别并应对威胁，确保安全操作部署。我们的工作采用三种检测方法：通过反向图灵测试识别恶意代理，通过多代理模拟分析欺骗性对齐，并通过工具介导的对抗场景测试GEMINI 1.5 Pro、llama-3.3-70B和deepseek r1模型，开发反越狱系统。检测能力强大，如GEMINI 1.5 Pro的准确率达到94%，但在长时间攻击下系统仍存在持续漏洞，随着提示长度增加，攻击成功率（ASR）上升，多样性指标在预测中失效，同时暴露出多个复杂系统缺陷。研究结果表明，有必要采用基于代理自身主动监控的灵活安全系统，并结合系统管理员的适应性干预，因为当前模型可能产生漏洞，导致系统不可靠且易受攻击。因此，在我们的工作中，我们尝试应对此类情况，并提出一个综合框架以应对这些安全问题。

English

The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations and develops an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated adversarial scenarios. The detection capabilities are strong such as 94\% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.

智能体系统的守护者：防范多轮攻击下的系统越狱风险

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

摘要

Summary

Support

Support