利用未标记的先前数据中的技能进行高效的在线探索

摘要

无监督预训练已经在许多监督领域产生了深远影响。然而，将这些想法应用于强化学习（RL）则面临着独特的挑战，因为微调不涉及模仿特定任务数据，而是通过迭代的自我改进来探索和定位解决方案。在这项工作中，我们研究了如何利用未标记的先前轨迹数据来学习高效的探索策略。虽然先前的数据可以用于预训练一组低层技能，或作为在线RL的额外离线数据，但如何有效地将这些想法结合起来用于在线探索尚不清楚。我们的方法SUPE（从未标记的先前数据中提取技能用于探索）表明，仔细结合这些想法可以增加它们的好处。我们的方法首先使用变分自动编码器（VAE）提取低层技能，然后使用乐观奖励模型伪标记未标记的轨迹，将先前数据转化为高层、与任务相关的示例。最后，SUPE使用这些转化后的示例作为在线RL的额外离线数据，以学习一个高层策略，将预训练的低层技能组合起来实现高效探索。我们经验性地展示，SUPE可靠地优于先前的策略，在成功解决一系列长时间跨度、稀疏奖励任务方面表现出色。源代码：https://github.com/rail-berkeley/supe。

English

Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an optimistic reward model, transforming prior data into high-level, task-relevant examples. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.

利用未标记的先前数据中的技能进行高效的在线探索

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

摘要

Summary

Support

Support