ChatPaper.aiChatPaper

利用未标记的先前数据中的技能进行高效的在线探索

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

October 23, 2024
作者: Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine
cs.AI

摘要

无监督预训练已经在许多监督领域产生了深远影响。然而,将这些想法应用于强化学习(RL)则面临着独特的挑战,因为微调不涉及模仿特定任务数据,而是通过迭代的自我改进来探索和定位解决方案。在这项工作中,我们研究了如何利用未标记的先前轨迹数据来学习高效的探索策略。虽然先前的数据可以用于预训练一组低层技能,或作为在线RL的额外离线数据,但如何有效地将这些想法结合起来用于在线探索尚不清楚。我们的方法SUPE(从未标记的先前数据中提取技能用于探索)表明,仔细结合这些想法可以增加它们的好处。我们的方法首先使用变分自动编码器(VAE)提取低层技能,然后使用乐观奖励模型伪标记未标记的轨迹,将先前数据转化为高层、与任务相关的示例。最后,SUPE使用这些转化后的示例作为在线RL的额外离线数据,以学习一个高层策略,将预训练的低层技能组合起来实现高效探索。我们经验性地展示,SUPE可靠地优于先前的策略,在成功解决一系列长时间跨度、稀疏奖励任务方面表现出色。源代码:https://github.com/rail-berkeley/supe。
English
Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an optimistic reward model, transforming prior data into high-level, task-relevant examples. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.

Summary

AI-Generated Summary

PDF42November 16, 2024