LookAhead Tuning: Modelli Linguistici più Sicuri tramite Anteprime Parziali delle Risposte

Abstract

Il fine-tuning consente ai grandi modelli linguistici (LLM) di adattarsi a domini specifici, ma spesso compromette il loro allineamento alla sicurezza precedentemente stabilito. Per mitigare il degrado della sicurezza del modello durante il fine-tuning, introduciamo LookAhead Tuning, che comprende due metodi semplici, a basso consumo di risorse ed efficaci basati sui dati, che modificano i dati di addestramento visualizzando prefissi parziali delle risposte. Entrambi i metodi mirano a preservare i meccanismi di sicurezza intrinseci del modello minimizzando le perturbazioni alle distribuzioni iniziali dei token. Esperimenti completi dimostrano che LookAhead Tuning mantiene efficacemente la sicurezza del modello senza sacrificare le prestazioni robuste sui task downstream. I nostri risultati posizionano LookAhead Tuning come una soluzione affidabile ed efficiente per l'adattamento sicuro ed efficace degli LLM. Il codice è disponibile all'indirizzo https://github.com/zjunlp/LookAheadTuning.

English

Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.

LookAhead Tuning: Modelli Linguistici più Sicuri tramite Anteprime Parziali delle Risposte

LookAhead Tuning: Safer Language Models via Partial Answer Previews

Abstract

Summary

Support

Support