在推理时间中交换计算资源以获得对抗鲁棒性

摘要

我们对增加推理时间计算对推理模型（具体为OpenAI o1-preview和o1-mini）对抗性攻击鲁棒性的影响进行实验。我们发现，在各种攻击中，增加推理时间计算会提高模型的鲁棒性。在许多情况下（但也存在重要的例外情况），随着测试时间计算量的增加，攻击成功的模型样本比例趋近于零。我们对所研究任务未进行任何对抗性训练，并通过简单地允许模型在推理过程中花费更多计算资源来增加推理时间计算，而不考虑攻击形式。我们的结果表明，推理时间计算有潜力提高大型语言模型的对抗性鲁棒性。我们还探讨了针对推理模型的新攻击，以及推理时间计算并未提高可靠性的情况，并推测了这些情况的原因以及解决方法。

English

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

在推理时间中交换计算资源以获得对抗鲁棒性

Trading Inference-Time Compute for Adversarial Robustness

摘要

Summary

Support