MSTS:用于视觉-语言模型的多模态安全测试套件
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
January 17, 2025
作者: Paul Röttger, Giuseppe Attanasio, Felix Friedrich, Janis Goldzycher, Alicia Parrish, Rishabh Bhardwaj, Chiara Di Bonaventura, Roman Eng, Gaia El Khoury Geagea, Sujata Goswami, Jieun Han, Dirk Hovy, Seogyeong Jeong, Paloma Jeretič, Flor Miriam Plaza-del-Arco, Donya Rooein, Patrick Schramowski, Anastassia Shaitarova, Xudong Shen, Richard Willats, Andrea Zugarini, Bertie Vidgen
cs.AI
摘要
视觉语言模型(VLMs)处理图像和文本输入,越来越多地整合到聊天助手和其他消费者人工智能应用中。然而,如果没有适当的保障措施,VLMs可能会提供有害建议(例如如何自残)或鼓励不安全行为(例如吸毒)。尽管存在明显的危险,迄今为止很少有研究评估VLM的安全性以及多模态输入带来的新风险。为了填补这一空白,我们引入了MSTS,一个面向VLM的多模态安全测试套件。MSTS包括40个细粒度危险类别中的400个测试提示。每个测试提示包含一段文本和一幅图像,只有结合起来才能揭示它们完整的不安全含义。通过MSTS,我们发现几个开放式VLM中存在明显的安全问题。我们还发现一些VLM之所以安全,纯属偶然,因为它们甚至无法理解简单的测试提示。我们将MSTS翻译成十种语言,展示非英语提示以增加不安全模型响应的比率。我们还展示,与多模态提示相比,仅使用文本进行测试时模型更安全。最后,我们探讨了VLM安全评估的自动化,发现即使是最好的安全分类器也存在不足。
English
Vision-language models (VLMs), which process image and text inputs, are
increasingly integrated into chat assistants and other consumer AI
applications. Without proper safeguards, however, VLMs may give harmful advice
(e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs).
Despite these clear hazards, little work so far has evaluated VLM safety and
the novel risks created by multimodal inputs. To address this gap, we introduce
MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts
across 40 fine-grained hazard categories. Each test prompt consists of a text
and an image that only in combination reveal their full unsafe meaning. With
MSTS, we find clear safety issues in several open VLMs. We also find some VLMs
to be safe by accident, meaning that they are safe because they fail to
understand even simple test prompts. We translate MSTS into ten languages,
showing non-English prompts to increase the rate of unsafe model responses. We
also show models to be safer when tested with text only rather than multimodal
prompts. Finally, we explore the automation of VLM safety assessments, finding
even the best safety classifiers to be lacking.Summary
AI-Generated Summary