花崗岩守衛

Granite Guardian

December 10, 2024
作者: Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri
cs.AI

摘要

我們介紹Granite Guardian模型,這是一套旨在為提示和回應提供風險檢測的保護措施,可與任何大型語言模型(LLM)結合使用,以確保安全和負責任的使用。這些模型在多個風險維度上提供全面覆蓋,包括社會偏見、粗話、暴力、性內容、不道德行為、越獄和與幻覺相關的風險,如上下文相關性、基礎性和檢索增強生成(RAG)的答案相關性。Granite Guardian模型是通過結合來自多個來源的人類標註和合成數據訓練而成,解決了傳統風險檢測模型通常忽視的風險,例如越獄和RAG特定問題。在有害內容和RAG幻覺相關基準上的AUC分別為0.871和0.854,Granite Guardian是當前空間中最具泛化性和競爭力的模型。Granite Guardian作為開源發布,旨在促進社區內負責任的AI開發。 https://github.com/ibm-granite/granite-guardian
English
We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian

Summary

AI-Generated Summary

PDF182December 11, 2024