ChatPaper.aiChatPaper

消融不足以模拟DPO:神经元动力学如何驱动毒性减少

Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction

November 10, 2024
作者: Yushi Yang, Filip Sondej, Harry Mayne, Adam Mahdi
cs.AI

摘要

安全微调算法通常用于微调语言模型以减少有害输出,但这些模型实现这一目标的确切内部机制仍不清楚。在研究直接偏好优化(DPO)用于减少毒性时,目前的解释声称DPO通过减弱最有毒的MLP神经元来学习一个偏移量,以避免毒性区域在剩余流中。然而,通过去除最有毒的神经元并应用激活修补,我们发现这一解释并不完整。通过将神经元激活变化投影到毒性探测器上,我们发现仅有31.8\%的毒性减少来自减弱的有毒神经元。相反,DPO通过在多个神经元组中累积效应来减少毒性,既减少了沿着有毒方向的写作,又促进了剩余流中的抗毒性。此外,DPO对神经元激活进行了嘈杂的调整,许多神经元实际上增加了毒性。这表明DPO是在对立神经元效应之间进行平衡以实现毒性减少的过程。
English
Safety fine-tuning algorithms are commonly used to fine-tune language models to reduce harmful outputs, but the exact internal mechanisms of how those models achieve this remain unclear. In studying direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO works by dampening the most toxic MLP neurons to learn an offset to avert toxic regions in the residual stream. However, by ablating the most toxic neurons and applying activation patching, we find this explanation incomplete. By projecting neuron activation changes onto a toxicity probe, we find that only 31.8\% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity by accumulating effects across multiple neuron groups, both reducing writing in the toxic direction and promoting anti-toxicity in the residual stream. Moreover, DPO gives noisy adjustments to neuron activations, with many neurons actually increasing toxicity. This indicates that DPO is a balancing process between opposing neuron effects to achieve toxicity reduction.

Summary

AI-Generated Summary

PDF52November 12, 2024