利用來自未標記的先前數據的技能，以進行高效的線上探索。

摘要

非監督式預訓練在許多監督式領域中具有轉型性。然而，將這些想法應用於強化學習 (RL) 在於一個獨特的挑戰，因為微調不涉及模仿特定任務的數據，而是通過迭代的自我改進來探索並定位解決方案。在這項研究中，我們探討如何利用未標記的先前軌跡數據來學習高效的探索策略。雖然先前的數據可用於預先訓練一組低級技能，或作為在線 RL 的額外離線數據，但如何有效結合這些想法以進行在線探索尚不清楚。我們的方法 SUPE (來自未標記先前數據的技能用於探索) 示範了仔細結合這些想法如何增強其優勢。我們的方法首先使用變分自編碼器 (VAE) 提取低級技能，然後使用樂觀獎勵模型對未標記的軌跡進行虛擬重新標記，將先前數據轉換為高級、任務相關的示例。最後，SUPE 使用這些轉換後的示例作為在線 RL 的額外離線數據，以學習一個高級策略，該策略組合了預先訓練的低級技能以實現高效探索。我們實證表明，SUPE 可靠地優於先前的策略，成功解決了一系列長時間跨度、稀疏獎勵任務。程式碼：https://github.com/rail-berkeley/supe。

English

Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an optimistic reward model, transforming prior data into high-level, task-relevant examples. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.

利用來自未標記的先前數據的技能，以進行高效的線上探索。

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

摘要

Summary

Support

Support