非同步RLHF:更快速和更有效的基於語言模型的離策略RL
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
October 23, 2024
作者: Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville
cs.AI
摘要
RLHF 的主導範式是線上和同策略強化學習:從大型語言模型(LLM)同步生成,使用獎勵模型進行標記,並使用對LLM自身輸出的反饋進行學習。儘管效能出色,但這種範式在計算上效率低下。受經典深度強化學習文獻的啟發,我們提出在RLHF中分離生成和學習。這使得可以異步生成新樣本,同時在舊樣本上進行訓練,從而加快訓練速度並實現更加計算效率的擴展。然而,異步訓練依賴於一個未經深入探討的範疇,即線上但離策略RLHF:在我們模型先前迭代的樣本上進行學習。為了了解這個範疇中的挑戰,我們研究一個基本問題:為了加快學習速度但保持性能,我們能容忍多少離策略性?在我們測試的幾種RLHF算法中,我們發現線上DPO對離策略數據最為穩健,而且穩健性隨著策略模型規模的增加而增加。我們進一步研究了異步RLHF的計算優化,但發現這些優化會帶來性能成本,產生一種權衡。最後,我們通過在指令跟隨任務上訓練LLaMA 3.1 8B,證實了異步RLHF的可擴展性,比同步運行快40%,同時實現了相同的最終性能。
English
The dominant paradigm for RLHF is online and on-policy RL: synchronously
generating from the large language model (LLM) policy, labelling with a reward
model, and learning using feedback on the LLM's own outputs. While performant,
this paradigm is computationally inefficient. Inspired by classical deep RL
literature, we propose separating generation and learning in RLHF. This enables
asynchronous generation of new samples while simultaneously training on old
samples, leading to faster training and more compute-optimal scaling. However,
asynchronous training relies on an underexplored regime, online but off-policy
RLHF: learning on samples from previous iterations of our model. To understand
the challenges in this regime, we investigate a fundamental question: how much
off-policyness can we tolerate for asynchronous training to speed up learning
but maintain performance? Among several RLHF algorithms we tested, we find that
online DPO is most robust to off-policy data, and robustness increases with the
scale of the policy model. We study further compute optimizations for
asynchronous RLHF but find that they come at a performance cost, giving rise to
a trade-off. Finally, we verify the scalability of asynchronous RLHF by
training LLaMA 3.1 8B on an instruction-following task 40% faster than a
synchronous run while matching final performance.Summary
AI-Generated Summary