MaxInfoRL: 정보 이득 최대화를 통해 강화 학습에서 탐사 촉진

초록

강화 학습 (RL) 알고리즘은 현재 최적의 전략을 활용하는 것과 높은 보상을 가져올 수 있는 새로운 옵션을 탐색하는 것을 균형있게 맞추려고 합니다. 가장 일반적인 RL 알고리즘은 무방향 탐사를 사용하며, 즉 무작위로 행동의 일련의 순서를 선택합니다. 탐사는 호기심이나 모델에피스테믹 불확실성과 같은 내재적 보상을 사용하여도 이루어질 수 있습니다. 그러나 작업 및 내재적 보상을 효과적으로 균형잡는 것은 어려우며 종종 작업에 따라 다릅니다. 본 연구에서는 내재적 및 외재적 탐사를 균형있게 조절하는 MaxInfoRL이라는 프레임워크를 소개합니다. MaxInfoRL은 정보 이득과 같은 내재적 보상을 최대화하여 작업의 기본적인 정보에 대한 정보를 최대화함으로써 탐사를 유도합니다. 볼츠만 탐사와 결합하면 이 방법은 자연스럽게 가치 함수와 상태, 보상 및 행동에 대한 엔트로피의 최대화를 교환합니다. 우리의 방법이 다중 암기 밴딧의 간소화된 설정에서 하위 선형 후회를 달성함을 보여줍니다. 그런 다음 이 일반적인 공식을 연속 상태-행동 공간에 대한 다양한 오프-폴리시 모델 프리 RL 방법에 적용하여, 시각적 제어 작업과 같은 어려운 탐사 문제 및 복잡한 시나리오에서 우수한 성능을 달성하는 새로운 알고리즘을 얻을 수 있습니다.

English

Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

MaxInfoRL: 정보 이득 최대화를 통해 강화 학습에서 탐사 촉진

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

초록

Support