非同步RLHF：更快速和更有效的基於語言模型的離策略RL

摘要

RLHF 的主導範式是線上和同策略強化學習：從大型語言模型（LLM）同步生成，使用獎勵模型進行標記，並使用對LLM自身輸出的反饋進行學習。儘管效能出色，但這種範式在計算上效率低下。受經典深度強化學習文獻的啟發，我們提出在RLHF中分離生成和學習。這使得可以異步生成新樣本，同時在舊樣本上進行訓練，從而加快訓練速度並實現更加計算效率的擴展。然而，異步訓練依賴於一個未經深入探討的範疇，即線上但離策略RLHF：在我們模型先前迭代的樣本上進行學習。為了了解這個範疇中的挑戰，我們研究一個基本問題：為了加快學習速度但保持性能，我們能容忍多少離策略性？在我們測試的幾種RLHF算法中，我們發現線上DPO對離策略數據最為穩健，而且穩健性隨著策略模型規模的增加而增加。我們進一步研究了異步RLHF的計算優化，但發現這些優化會帶來性能成本，產生一種權衡。最後，我們通過在指令跟隨任務上訓練LLaMA 3.1 8B，證實了異步RLHF的可擴展性，比同步運行快40％，同時實現了相同的最終性能。

English

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

非同步RLHF：更快速和更有效的基於語言模型的離策略RL

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

摘要

Summary

Support

Support