OpenAI 的 o3-mini 早期外部安全测试：来自部署前评估的见解

摘要

大型语言模型（LLMs）已经成为我们日常生活中不可或缺的一部分。然而，它们带来了一定的风险，包括可能损害个人隐私、持续存在偏见并传播错误信息。这些风险突显了确保其负责任部署所需的强大安全机制、伦理准则和彻底测试的重要性。LLMs的安全性是一个需要在模型部署和向普通用户提供之前进行彻底测试的关键属性。本文报告了蒙德拉贡大学和塞维利亚大学研究人员在OpenAI的o3-mini LLM上进行的外部安全性测试经验，作为OpenAI早期安全性测试计划的一部分。具体来说，我们使用我们的工具ASTRAL，自动生成并系统地生成最新的不安全测试输入（即提示），帮助我们测试和评估LLMs的不同安全类别。我们在早期o3-mini测试版上自动生成并执行了总共10,080个不安全测试输入。在手动验证ASTRAL分类为不安全的测试用例后，我们确定了共计87个不安全LLM行为的实际实例。我们突出了在OpenAI最新LLM的部署前外部测试阶段发现的关键见解和发现。

English

Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.

OpenAI 的 o3-mini 早期外部安全测试：来自部署前评估的见解

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

摘要

Summary

Support