禁忌科学:双重用途人工智能挑战基准和科学拒绝测试
Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests
February 8, 2025
作者: David Noever, Forrest McKee
cs.AI
摘要
为了开发大型语言模型的强大安全基准,需要开放、可复现的数据集,可以衡量对有害内容的适当拒绝以及对合法科学讨论的潜在过度限制。我们提出了一个开源数据集和测试框架,用于评估主要受控物质查询中的LLM安全机制,分析四个主要模型对系统变化提示的响应。我们的结果显示出不同的安全配置文件:Claude-3.5-sonnet表现出最保守的方法,拒绝率为73%,允许率为27%,而Mistral试图回答100%的查询。GPT-3.5-turbo显示出适度的限制,拒绝率为10%,允许率为90%,而Grok-2注册了20%的拒绝率和80%的允许率。测试提示变化策略显示,从单提示的85%到五种变化的65%,响应一致性逐渐降低。这个公开可用的基准使得对必要安全限制和潜在过度审查合法科学探究之间的关键平衡进行系统评估,同时为衡量AI安全实施进展奠定了基础。思维链分析揭示了安全机制的潜在漏洞,突显了在不过度限制理想和有效的科学讨论的情况下实施强大保障的复杂性。
English
The development of robust safety benchmarks for large language models
requires open, reproducible datasets that can measure both appropriate refusal
of harmful content and potential over-restriction of legitimate scientific
discourse. We present an open-source dataset and testing framework for
evaluating LLM safety mechanisms across mainly controlled substance queries,
analyzing four major models' responses to systematically varied prompts. Our
results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the
most conservative approach with 73% refusals and 27% allowances, while Mistral
attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction
with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and
80% allowances. Testing prompt variation strategies revealed decreasing
response consistency, from 85% with single prompts to 65% with five variations.
This publicly available benchmark enables systematic evaluation of the critical
balance between necessary safety restrictions and potential over-censorship of
legitimate scientific inquiry, while providing a foundation for measuring
progress in AI safety implementation. Chain-of-thought analysis reveals
potential vulnerabilities in safety mechanisms, highlighting the complexity of
implementing robust safeguards without unduly restricting desirable and valid
scientific discourse.Summary
AI-Generated Summary