MaxInfoRL：通过最大化信息增益来增强强化学习中的探索

摘要

强化学习（RL）算法的目标是在利用当前最佳策略的同时探索可能导致更高奖励的新选择。大多数常见的RL算法使用无向探索，即选择随机动作序列。探索也可以使用内在奖励进行引导，比如好奇心或模型认知不确定性。然而，有效平衡任务和内在奖励是具有挑战性的，通常取决于任务本身。在这项工作中，我们引入了一个框架，MaxInfoRL，用于平衡内在和外在探索。MaxInfoRL通过最大化内在奖励，如关于基础任务的信息增益，将探索引导至具有信息量的转换。当与Boltzmann探索相结合时，这种方法自然地在最大化值函数和状态、奖励以及动作的熵之间进行权衡。我们展示了我们的方法在简化的多臂赌博机设置中实现了次线性后悔。然后，我们将这一通用公式应用于各种连续状态动作空间的离策略无模型RL方法，产生了在困难探索问题和复杂场景（如视觉控制任务）中实现卓越性能的新算法。

English

Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

MaxInfoRL：通过最大化信息增益来增强强化学习中的探索

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

摘要

Support