自动RT:针对大型语言模型的红队行动自动越狱策略探索
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
January 3, 2025
作者: Yanjiang Liu, Shuhen Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, Le Sun
cs.AI
摘要
自动化红队技术已成为发现大型语言模型(LLMs)中漏洞的关键方法。然而,大多数现有方法侧重于孤立的安全缺陷,限制了其适应动态防御并高效发现复杂漏洞的能力。为了解决这一挑战,我们提出了Auto-RT,这是一个强化学习框架,能够自动探索和优化复杂的攻击策略,通过恶意查询有效地发现安全漏洞。具体来说,我们引入了两个关键机制来减少探索复杂性并改善策略优化:1)提前终止探索,通过专注于高潜攻击策略来加速探索;2)具有中间降级模型的渐进式奖励跟踪算法,动态地优化搜索轨迹以实现成功利用漏洞。在各种LLMs上进行的大量实验表明,通过显著提高探索效率和自动优化攻击策略,Auto-RT能够检测到更广泛的漏洞,实现更快的检测速度,并比现有方法成功率高出16.63%。
English
Automated red-teaming has become a crucial approach for uncovering
vulnerabilities in large language models (LLMs). However, most existing methods
focus on isolated safety flaws, limiting their ability to adapt to dynamic
defenses and uncover complex vulnerabilities efficiently. To address this
challenge, we propose Auto-RT, a reinforcement learning framework that
automatically explores and optimizes complex attack strategies to effectively
uncover security vulnerabilities through malicious queries. Specifically, we
introduce two key mechanisms to reduce exploration complexity and improve
strategy optimization: 1) Early-terminated Exploration, which accelerate
exploration by focusing on high-potential attack strategies; and 2) Progressive
Reward Tracking algorithm with intermediate downgrade models, which dynamically
refine the search trajectory toward successful vulnerability exploitation.
Extensive experiments across diverse LLMs demonstrate that, by significantly
improving exploration efficiency and automatically optimizing attack
strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a
faster detection speed and 16.63\% higher success rates compared to existing
methods.Summary
AI-Generated Summary