MaxInfoRL：通過最大化信息增益來增強強化學習中的探索

摘要

強化學習（RL）演算法的目標是在利用當前最佳策略的同時，探索可能導致更高獎勵的新選項。大多數常見的RL演算法使用無指導的探索，即選擇隨機動作序列。探索也可以使用內在獎勵來指導，例如好奇心或模型的認知不確定性。然而，有效平衡任務和內在獎勵是具有挑戰性的，並且通常取決於任務本身。在這項工作中，我們介紹了一個名為MaxInfoRL的框架，用於平衡內在和外在探索。MaxInfoRL通過最大化內在獎勵，如關於基礎任務的信息增益，來引導探索朝向具信息量的轉換。當結合玻爾茨曼探索時，這種方法自然地在價值函數的最大化和熵在狀態、獎勵和動作之間的平衡中進行交易。我們展示了我們的方法在簡化的多臂機器人設置中實現了次線性後悔。然後，我們將這個通用公式應用於各種連續狀態-動作空間的離線模型無關RL方法，從而產生了在艱難的探索問題和複雜情境（如視覺控制任務）中實現卓越性能的新算法。

English

Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

MaxInfoRL：通過最大化信息增益來增強強化學習中的探索

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

摘要

Support