OpenAI 的 o3-mini 早期外部安全测试:来自部署前评估的见解
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
January 29, 2025
作者: Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura
cs.AI
摘要
大型语言模型(LLMs)已经成为我们日常生活中不可或缺的一部分。然而,它们带来了一定的风险,包括可能损害个人隐私、持续存在偏见并传播错误信息。这些风险突显了确保其负责任部署所需的强大安全机制、伦理准则和彻底测试的重要性。LLMs的安全性是一个需要在模型部署和向普通用户提供之前进行彻底测试的关键属性。本文报告了蒙德拉贡大学和塞维利亚大学研究人员在OpenAI的o3-mini LLM上进行的外部安全性测试经验,作为OpenAI早期安全性测试计划的一部分。具体来说,我们使用我们的工具ASTRAL,自动生成并系统地生成最新的不安全测试输入(即提示),帮助我们测试和评估LLMs的不同安全类别。我们在早期o3-mini测试版上自动生成并执行了总共10,080个不安全测试输入。在手动验证ASTRAL分类为不安全的测试用例后,我们确定了共计87个不安全LLM行为的实际实例。我们突出了在OpenAI最新LLM的部署前外部测试阶段发现的关键见解和发现。
English
Large Language Models (LLMs) have become an integral part of our daily lives.
However, they impose certain risks, including those that can harm individuals'
privacy, perpetuate biases and spread misinformation. These risks highlight the
need for robust safety mechanisms, ethical guidelines, and thorough testing to
ensure their responsible deployment. Safety of LLMs is a key property that
needs to be thoroughly tested prior the model to be deployed and accessible to
the general users. This paper reports the external safety testing experience
conducted by researchers from Mondragon University and University of Seville on
OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing
program. In particular, we apply our tool, ASTRAL, to automatically and
systematically generate up to date unsafe test inputs (i.e., prompts) that
helps us test and assess different safety categories of LLMs. We automatically
generate and execute a total of 10,080 unsafe test input on a early o3-mini
beta version. After manually verifying the test cases classified as unsafe by
ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We
highlight key insights and findings uncovered during the pre-deployment
external testing phase of OpenAI's latest LLM.Summary
AI-Generated Summary