RL Zero:零監督下的零樣本語言到行為
RL Zero: Zero-Shot Language to Behaviors without any Supervision
December 7, 2024
作者: Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum
cs.AI
摘要
獎勵仍然是指定強化學習任務的一種難以解釋的方式,因為人類通常無法預測任何給定獎勵函數的最佳行為,導致獎勵設計不佳和獎勵入侵。語言提供了一種吸引人的方式來向代理傳達意圖並繞過獎勵設計,但先前的努力受制於昂貴且不可擴展的標註工作。在這項工作中,我們提出了一種完全無監督的方法,以零樣本方式將語言指令與策略進行基準。我們提出了一種解決方案,採用想像、投影和模仿的形式:代理想像與任務的語言描述相對應的觀察序列,將想像的序列投影到我們的目標領域,並將其基準為一個策略。視頻語言模型使我們能夠想像利用從互聯網規模的視頻文本映射中學到的任務知識的任務描述。挑戰在於將這些生成物基準為一個策略。在這項工作中,我們展示了通過首先將想像的序列基準於無監督RL代理的真實觀察結果,並使用一個閉合形式的模仿學習解決方案,使RL代理能夠模仿基準觀察結果,我們可以實現零樣本語言到行為策略。我們的方法RLZero是我們所知道的第一個展示零樣本語言到行為生成能力的方法,在模擬領域的各種任務上沒有任何監督。我們進一步展示了RLZero還可以從來自YouTube等跨體驗視頻中零樣本生成策略。
English
Rewards remain an uninterpretable way to specify tasks for Reinforcement
Learning, as humans are often unable to predict the optimal behavior of any
given reward function, leading to poor reward design and reward hacking.
Language presents an appealing way to communicate intent to agents and bypass
reward design, but prior efforts to do so have been limited by costly and
unscalable labeling efforts. In this work, we propose a method for a completely
unsupervised alternative to grounding language instructions in a zero-shot
manner to obtain policies. We present a solution that takes the form of
imagine, project, and imitate: The agent imagines the observation sequence
corresponding to the language description of a task, projects the imagined
sequence to our target domain, and grounds it to a policy. Video-language
models allow us to imagine task descriptions that leverage knowledge of tasks
learned from internet-scale video-text mappings. The challenge remains to
ground these generations to a policy. In this work, we show that we can achieve
a zero-shot language-to-behavior policy by first grounding the imagined
sequences in real observations of an unsupervised RL agent and using a
closed-form solution to imitation learning that allows the RL agent to mimic
the grounded observations. Our method, RLZero, is the first to our knowledge to
show zero-shot language to behavior generation abilities without any
supervision on a variety of tasks on simulated domains. We further show that
RLZero can also generate policies zero-shot from cross-embodied videos such as
those scraped from YouTube.Summary
AI-Generated Summary