DuoGuard:一种用于多语言LLM的双人RL驱动框架Guardrails
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
February 7, 2025
作者: Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li
cs.AI
摘要
大型语言模型(LLMs)的快速发展增加了对防护栏模型的需求,以确保负责任的使用,特别是在检测不安全和非法内容方面。虽然英语中存在大量安全数据,但由于其他语言开源安全数据稀缺,多语言防护栏建模仍未得到充分探索。为了填补这一空白,我们提出了一种新颖的双人强化学习(RL)框架,其中生成器和防护栏模型对抗性地共同进化,生成高质量的多语言防护栏训练合成数据。我们在理论上将这种互动形式化为一个双人博弈,证明了收敛到纳什均衡。实证评估表明,我们的模型\ours 在英语基准测试中优于最先进模型,性能提升近10%,同时在推断速度上比 LlamaGuard3(8B)快4.5倍,且模型规模显著更小(0.5B)。我们在多语言安全任务方面取得了重大进展,特别是在处理收集的真实数据集中低资源语言的不平衡。消融研究强调了合成数据生成在弥合英语和其他语言之间开源数据不平衡中的关键作用。这些发现确立了一种可扩展且高效的合成数据生成方法,为改进多语言防护栏模型以增强LLM安全铺平了道路。代码、模型和数据将在 https://github.com/yihedeng9/DuoGuard 上开源。
English
The rapid advancement of large language models (LLMs) has increased the need
for guardrail models to ensure responsible use, particularly in detecting
unsafe and illegal content. While substantial safety data exist in English,
multilingual guardrail modeling remains underexplored due to the scarcity of
open-source safety data in other languages. To address this gap, we propose a
novel two-player Reinforcement Learning (RL) framework, where a generator and a
guardrail model co-evolve adversarially to produce high-quality synthetic data
for multilingual guardrail training. We theoretically formalize this
interaction as a two-player game, proving convergence to a Nash equilibrium.
Empirical evaluations show that our model \ours outperforms state-of-the-art
models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English
benchmarks while being 4.5x faster at inference with a significantly smaller
model (0.5B). We achieve substantial advancements in multilingual safety tasks,
particularly in addressing the imbalance for lower-resource languages in a
collected real dataset. Ablation studies emphasize the critical role of
synthetic data generation in bridging the imbalance in open-source data between
English and other languages. These findings establish a scalable and efficient
approach to synthetic data generation, paving the way for improved multilingual
guardrail models to enhance LLM safety. Code, model, and data will be
open-sourced at https://github.com/yihedeng9/DuoGuard.Summary
AI-Generated Summary