ChatPaper.aiChatPaper

基于价值的深度强化学习能够可预测地扩展。

Value-Based Deep RL Scales Predictably

February 6, 2025
作者: Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar
cs.AI

摘要

数据和计算的扩展对机器学习的成功至关重要。然而,扩展需要可预测性:我们希望方法不仅在有更多计算或数据时表现良好,而且其性能也可以从小规模运行中预测,而无需运行大规模实验。在本文中,我们展示了基于价值的离策略强化学习方法是可预测的,尽管社区传闻其存在病态行为。首先,我们展示了实现给定性能水平所需的数据和计算要求位于由更新与数据(UTD)比率控制的帕累托前沿上。通过估计这一前沿,我们可以在给定更多计算时预测这一数据需求,以及在给定更多数据时预测这一计算需求。其次,我们确定了在给定性能情况下跨数据和计算分配总资源预算的最佳方式,并用它来确定最大化给定预算下性能的超参数。第三,这种扩展行为是通过首先估计超参数之间可预测关系来实现的,这些关系用于管理强化学习中独特的过拟合和可塑性损失效应。我们使用三种算法(SAC、BRO 和 PQL)在 DeepMind Control、OpenAI gym 和 IsaacGym 上验证了我们的方法,当对数据、计算、预算或性能进行外推时。
English
Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

Summary

AI-Generated Summary

PDF65February 10, 2025