利用來自未標記的先前數據的技能,以進行高效的線上探索。
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration
October 23, 2024
作者: Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine
cs.AI
摘要
非監督式預訓練在許多監督式領域中具有轉型性。然而,將這些想法應用於強化學習 (RL) 在於一個獨特的挑戰,因為微調不涉及模仿特定任務的數據,而是通過迭代的自我改進來探索並定位解決方案。在這項研究中,我們探討如何利用未標記的先前軌跡數據來學習高效的探索策略。雖然先前的數據可用於預先訓練一組低級技能,或作為在線 RL 的額外離線數據,但如何有效結合這些想法以進行在線探索尚不清楚。我們的方法 SUPE (來自未標記先前數據的技能用於探索) 示範了仔細結合這些想法如何增強其優勢。我們的方法首先使用變分自編碼器 (VAE) 提取低級技能,然後使用樂觀獎勵模型對未標記的軌跡進行虛擬重新標記,將先前數據轉換為高級、任務相關的示例。最後,SUPE 使用這些轉換後的示例作為在線 RL 的額外離線數據,以學習一個高級策略,該策略組合了預先訓練的低級技能以實現高效探索。我們實證表明,SUPE 可靠地優於先前的策略,成功解決了一系列長時間跨度、稀疏獎勵任務。程式碼:https://github.com/rail-berkeley/supe。
English
Unsupervised pretraining has been transformative in many supervised domains.
However, applying such ideas to reinforcement learning (RL) presents a unique
challenge in that fine-tuning does not involve mimicking task-specific data,
but rather exploring and locating the solution through iterative
self-improvement. In this work, we study how unlabeled prior trajectory data
can be leveraged to learn efficient exploration strategies. While prior data
can be used to pretrain a set of low-level skills, or as additional off-policy
data for online RL, it has been unclear how to combine these ideas effectively
for online exploration. Our method SUPE (Skills from Unlabeled Prior data for
Exploration) demonstrates that a careful combination of these ideas compounds
their benefits. Our method first extracts low-level skills using a variational
autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an
optimistic reward model, transforming prior data into high-level, task-relevant
examples. Finally, SUPE uses these transformed examples as additional
off-policy data for online RL to learn a high-level policy that composes
pretrained low-level skills to explore efficiently. We empirically show that
SUPE reliably outperforms prior strategies, successfully solving a suite of
long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.Summary
AI-Generated Summary