가치 기반 심층 강화 학습은 예측 가능하게 확장됩니다.

초록

데이터 및 컴퓨팅 확장은 기계 학습의 성공에 중요합니다. 그러나 확장은 예측 가능성을 요구합니다: 우리는 방법이 더 많은 컴퓨팅 또는 데이터로만 잘 작동하는 것뿐만 아니라, 작은 규모의 실행에서도 성능이 예측 가능하도록 원합니다. 본 논문에서는 가치 기반 오프-폴리시 강화 학습 방법이 예측 가능하다는 것을 보여줍니다. 이는 그들의 병적인 행동에 대한 커뮤니티의 전설에도 불구하고입니다. 먼저, 주어진 성능 수준을 달성하기 위한 데이터 및 컴퓨팅 요구 사항이 업데이트 대 데이터(UTD) 비율에 의해 제어되는 파레토 프론티어에 위치함을 보여줍니다. 이 프론티어를 추정함으로써, 더 많은 컴퓨팅이 주어졌을 때 이 데이터 요구 사항을 예측하고, 더 많은 데이터가 주어졌을 때 이 컴퓨팅 요구 사항을 예측할 수 있습니다. 둘째, 특정 성능을 위해 데이터와 컴퓨팅에 대한 총 자원 예산의 최적 할당을 결정하고, 주어진 예산에 대해 성능을 최대화하는 하이퍼파라미터를 결정합니다. 셋째, 이러한 확장 행동은 먼저 강화 학습에만 해당되는 과적합과 플라스티시티 손실의 효과를 관리하기 위해 하이퍼파라미터 간의 예측 가능한 관계를 추정함으로써 가능해집니다. 우리는 DeepMind Control, OpenAI gym, IsaacGym에서 SAC, BRO 및 PQL 세 가지 알고리즘을 사용하여 접근 방식을 검증합니다. 이때 더 높은 데이터, 컴퓨팅, 예산 또는 성능으로 추정합니다.

English

Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

가치 기반 심층 강화 학습은 예측 가능하게 확장됩니다.

Value-Based Deep RL Scales Predictably

초록

Summary

Support