소성제는 DPO를 에뮬레이트하는 데 충분하지 않습니다: 신경 원자 역학이 독성 감소를 이끕니다.

초록

안전 세부 조정 알고리즘은 유해한 출력을 줄이기 위해 언어 모델을 세밀하게 조정하는 데 일반적으로 사용되지만, 이러한 모델이 이를 어떻게 달성하는지의 정확한 내부 메커니즘은 여전히 불분명합니다. 유해성 감소를 위한 직접 선호도 최적화(DPO)를 연구하는 과정에서, 현재의 설명은 DPO가 가장 유해한 MLP 뉴런을 억제하여 잔류 스트림에서 유해 지역을 피하기 위한 오프셋을 학습한다고 주장합니다. 그러나 가장 유해한 뉴런을 제거하고 활성화 패치를 적용함으로써, 우리는 이 설명이 불완전하다는 것을 발견했습니다. 뉴런 활성화 변화를 유해성 프로브에 투사함으로써, 유해성 감소의 31.8%만이 억제된 유해 뉴런에서 온다는 것을 발견했습니다. 대신, DPO는 여러 뉴런 그룹을 횡단하여 효과를 축적함으로써 유해 방향으로의 작성을 줄이고 잔류 스트림에서 안티-유해성을 촉진하여 유해성을 감소시킵니다. 게다가, DPO는 뉴런 활성화에 대해 잡음이 있는 조정을 제공하며, 많은 뉴런이 실제로 유해성을 증가시킵니다. 이는 DPO가 유해성 감소를 달성하기 위해 상반되는 뉴런 효과 사이의 균형 과정임을 나타냅니다.

English

Safety fine-tuning algorithms are commonly used to fine-tune language models to reduce harmful outputs, but the exact internal mechanisms of how those models achieve this remain unclear. In studying direct preference optimisation (DPO) for toxicity reduction, current explanations claim that DPO works by dampening the most toxic MLP neurons to learn an offset to avert toxic regions in the residual stream. However, by ablating the most toxic neurons and applying activation patching, we find this explanation incomplete. By projecting neuron activation changes onto a toxicity probe, we find that only 31.8\% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity by accumulating effects across multiple neuron groups, both reducing writing in the toxic direction and promoting anti-toxicity in the residual stream. Moreover, DPO gives noisy adjustments to neuron activations, with many neurons actually increasing toxicity. This indicates that DPO is a balancing process between opposing neuron effects to achieve toxicity reduction.

소성제는 DPO를 에뮬레이트하는 데 충분하지 않습니다: 신경 원자 역학이 독성 감소를 이끕니다.

Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction

초록

Summary

Support