花岗岩守卫
Granite Guardian
December 10, 2024
作者: Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri
cs.AI
摘要
我们介绍Granite Guardian模型,这是一套旨在为提示和响应提供风险检测的安全防护措施,可与任何大型语言模型(LLM)结合使用,以确保安全和负责任的使用。这些模型在多个风险维度上提供全面覆盖,包括社会偏见、粗话、暴力、性内容、不道德行为、越狱以及与幻觉相关的风险,如上下文相关性、基础性和用于检索增强生成(RAG)的答案相关性。Granite Guardian模型经过训练,使用了从多个来源获取的人类注释和合成数据相结合的独特数据集,解决了传统风险检测模型通常忽视的风险,如越狱和RAG特定问题。在有害内容和RAG幻觉相关基准上的AUC分别为0.871和0.854,Granite Guardian是当前空间中最具普适性和竞争力的模型。作为开源发布,Granite Guardian旨在推动社区内负责任的人工智能开发。
https://github.com/ibm-granite/granite-guardian
English
We introduce the Granite Guardian models, a suite of safeguards designed to
provide risk detection for prompts and responses, enabling safe and responsible
use in combination with any large language model (LLM). These models offer
comprehensive coverage across multiple risk dimensions, including social bias,
profanity, violence, sexual content, unethical behavior, jailbreaking, and
hallucination-related risks such as context relevance, groundedness, and answer
relevance for retrieval-augmented generation (RAG). Trained on a unique dataset
combining human annotations from diverse sources and synthetic data, Granite
Guardian models address risks typically overlooked by traditional risk
detection models, such as jailbreaks and RAG-specific issues. With AUC scores
of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks
respectively, Granite Guardian is the most generalizable and competitive model
available in the space. Released as open-source, Granite Guardian aims to
promote responsible AI development across the community.
https://github.com/ibm-granite/granite-guardianSummary
AI-Generated Summary