접두어 공유를 이용한 직접 선호도 최적화 가속화

초록

오프라인 페어링된 선호도 최적화 알고리즘은 선호도 데이터를 미세 조정하는 데 인기 있는 방법으로, 다양한 작업에서 전통적인 지도 미세 조정을 능가하고 있습니다. 그러나 전통적인 구현은 종종 장기적인 공유 프롬프트를 가진 작업에서 특히 중복 계산을 포함합니다. 접두사 공유를 소개합니다. 선호도 튜닝을 위한 기술로, 선택된 응답과 거부된 응답을 공유 접두사를 가진 하나의 시퀀스로 처리하는 새로운 기술입니다. 교차 응답 오염을 방지하기 위해 사용자 정의 블록-희소 어텐션 마스크를 사용합니다. 우리의 방법은 인기 있는 DPO 데이터셋에서 교육 처리량을 1.1-1.5배 향상시키며 수렴에는 영향을 미치지 않습니다. 시퀀스 패킹과 결합하면, 작은 시퀀스 길이를 가진 데이터셋에서도 일관된 1.3-1.6배 속도 향상을 관찰합니다. 우리는 직접적인 선호도 최적화(DPO)에 초점을 맞추지만, 우리의 방법은 다른 페어링된 선호도 튜닝 방법에도 적용할 수 있습니다. 계산 효율성을 향상시킴으로써, 우리의 작업은 다양한 응용 프로그램 및 모델 크기에 대한 선호도 기반 미세 조정을 보다 접근 가능하게 만드는 데 기여합니다. 우리의 코드는 https://github.com/frankxwang/dpo-prefix-sharing에서 오픈 소스로 제공됩니다.

English

Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve redundant computations, especially for tasks with long shared prompts. We introduce prefix sharing for preference tuning, a novel technique that processes chosen and rejected responses as one sequence with a shared prefix. To prevent cross-response contamination, we use a custom block-sparse attention mask. Our method achieves 1.1-1.5times improvement in training throughput on popular DPO datasets, without any effect on convergence. When combined with sequence packing, we observe consistent 1.3-1.6times speedups, benefiting even datasets with smaller sequence lengths. While we focus on Direct Preference Optimization (DPO), our approach is applicable to other paired preference tuning methods. By enhancing computational efficiency, our work contributes to making preference-based fine-tuning more accessible for a wider range of applications and model sizes. We open-source our code at https://github.com/frankxwang/dpo-prefix-sharing.

접두어 공유를 이용한 직접 선호도 최적화 가속화

Accelerating Direct Preference Optimization with Prefix Sharing

초록

Summary

Support