自動RT：大規模言語モデルのレッドチーミングのための自動ジェルブレイク戦略探索

要旨

自動化されたレッドチーミングは、大規模言語モデル（LLMs）における脆弱性の発見において重要な手法となっています。しかしながら、既存のほとんどの手法は孤立した安全性の欠陥に焦点を当てており、動的な防御に適応し、効率的に複雑な脆弱性を発見する能力が制限されています。この課題に対処するために、私たちはAuto-RTという強化学習フレームワークを提案します。このフレームワークは、悪意のあるクエリを通じてセキュリティの脆弱性を効果的に発見するために、複雑な攻撃戦略を自動的に探索および最適化します。具体的には、探索の複雑さを軽減し戦略の最適化を向上させるために、2つの主要なメカニズムを導入しています。1つ目は「早期終了探索」であり、高い潜在的攻撃戦略に焦点を当てることで探索を加速します。2つ目は、中間ダウングレードモデルを使用した「プログレッシブリワードトラッキングアルゴリズム」であり、成功した脆弱性の悪用に向けて探索軌跡を動的に洗練します。様々なLLMsを対象とした包括的な実験により、Auto-RTは探索効率を大幅に向上させ、攻撃戦略を自動的に最適化することで、既存の手法と比較してより幅広い範囲の脆弱性を検出し、より速い検出速度と16.63％高い成功率を達成しています。

English

Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63\% higher success rates compared to existing methods.

自動RT：大規模言語モデルのレッドチーミングのための自動ジェルブレイク戦略探索

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

要旨

Support