消融不足以模擬 DPO：神經元動態如何驅動毒性降低

摘要

安全微調算法通常用於微調語言模型以減少有害輸出，但這些模型實現此目標的確切內部機制仍不清楚。在研究直接偏好優化（DPO）以降低毒性時，目前的解釋聲稱DPO 通過抑制最具毒性的 MLP 神經元來學習一個偏移量，以避免殘留流中的有毒區域。然而，通過切除最具毒性的神經元並應用激活修補，我們發現這種解釋是不完整的。通過將神經元激活變化投影到毒性探針上，我們發現只有 31.8\% 的毒性降低來自抑制的有毒神經元。相反，DPO 通過在多個神經元組中累積效應來降低毒性，既減少了指向有毒方向的寫作，又促進了殘留流中的反毒性。此外，DPO 對神經元激活進行了噪聲調整，許多神經元實際上增加了毒性。這表明 DPO 是一個在對立神經元效應之間取得毒性降低的平衡過程。

English

Safety fine-tuning algorithms are commonly used to fine-tune language models to reduce harmful outputs, but the exact internal mechanisms of how those models achieve this remain unclear. In studying direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO works by dampening the most toxic MLP neurons to learn an offset to avert toxic regions in the residual stream. However, by ablating the most toxic neurons and applying activation patching, we find this explanation incomplete. By projecting neuron activation changes onto a toxicity probe, we find that only 31.8\% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity by accumulating effects across multiple neuron groups, both reducing writing in the toxic direction and promoting anti-toxicity in the residual stream. Moreover, DPO gives noisy adjustments to neuron activations, with many neurons actually increasing toxicity. This indicates that DPO is a balancing process between opposing neuron effects to achieve toxicity reduction.

消融不足以模擬 DPO：神經元動態如何驅動毒性降低

Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction

摘要

Summary

Support