使用MLLM作为图像安全性评估的裁判，无需人工标注。

摘要

随着在线平台上视觉媒体的兴起，图像内容安全已成为一个重要挑战。与此同时，在人工智能生成内容（AIGC）时代，许多图像生成模型能够生成包含色情或暴力内容的有害内容。因此，基于已建立的安全规则识别此类不安全图像变得至关重要。预训练的多模态大语言模型（MLLMs）在这方面具有潜力，因为它们具有强大的模式识别能力。现有方法通常使用人工标记的数据集对MLLMs进行微调，然而这带来了一系列缺点。首先，依赖人类标注者按照复杂详细的指南标记数据既昂贵又劳动密集。此外，安全判断系统的用户可能需要频繁更新安全规则，使得在基于人类标注的微调上更具挑战性。这引发了一个研究问题：我们能否通过在预定义的安全宪章（一组安全规则）中以零短训练方式查询MLLMs来检测不安全图像？我们的研究表明，简单地查询预训练的MLLMs并不能产生令人满意的结果。这种效果不佳源于诸如安全规则的主观性、冗长宪章的复杂性以及模型固有的偏见等因素。为了解决这些挑战，我们提出了一种基于MLLM的方法，包括客观化安全规则、评估规则与图像之间的相关性、基于去偏置的标记概率进行快速判断，使用逻辑上完整但简化的先决条件链进行安全规则，以及根据需要进行更深入的推理，采用级联的思维链过程。实验结果表明，我们的方法对于零短训练图像安全判断任务非常有效。

English

Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.

使用MLLM作为图像安全性评估的裁判，无需人工标注。

MLLM-as-a-Judge for Image Safety without Human Labeling

摘要

Support