DuoGuard：一种用于多语言LLM的双人RL驱动框架Guardrails

摘要

大型语言模型（LLMs）的快速发展增加了对防护栏模型的需求，以确保负责任的使用，特别是在检测不安全和非法内容方面。虽然英语中存在大量安全数据，但由于其他语言开源安全数据稀缺，多语言防护栏建模仍未得到充分探索。为了填补这一空白，我们提出了一种新颖的双人强化学习（RL）框架，其中生成器和防护栏模型对抗性地共同进化，生成高质量的多语言防护栏训练合成数据。我们在理论上将这种互动形式化为一个双人博弈，证明了收敛到纳什均衡。实证评估表明，我们的模型\ours 在英语基准测试中优于最先进模型，性能提升近10%，同时在推断速度上比 LlamaGuard3（8B）快4.5倍，且模型规模显著更小（0.5B）。我们在多语言安全任务方面取得了重大进展，特别是在处理收集的真实数据集中低资源语言的不平衡。消融研究强调了合成数据生成在弥合英语和其他语言之间开源数据不平衡中的关键作用。这些发现确立了一种可扩展且高效的合成数据生成方法，为改进多语言防护栏模型以增强LLM安全铺平了道路。代码、模型和数据将在 https://github.com/yihedeng9/DuoGuard 上开源。

English

The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.

DuoGuard：一种用于多语言LLM的双人RL驱动框架Guardrails

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

摘要

Summary

Support