RL Zero:零样本语言到行为的强化学习,无需任何监督
RL Zero: Zero-Shot Language to Behaviors without any Supervision
December 7, 2024
作者: Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum
cs.AI
摘要
奖励仍然是指定强化学习任务的一种难以解释的方式,因为人类通常无法预测任何给定奖励函数的最佳行为,导致奖励设计不佳和奖励欺骗。语言提供了一种吸引人的方式,可以向代理传达意图并绕过奖励设计,但先前的努力受到昂贵且不可扩展的标注工作的限制。在这项工作中,我们提出了一种完全无监督的方法,用于以零-shot方式将语言指令与策略进行基础。我们提出了一种解决方案,采用想象、投影和模仿的形式:代理程序想象与任务的语言描述相对应的观察序列,将想象的序列投影到我们的目标领域,并将其基础化为策略。视频语言模型使我们能够想象利用从互联网规模的视频文本映射中学到的任务知识的任务描述。挑战在于将这些生成基础化为策略。在这项工作中,我们展示了通过首先将想象的序列基础化到无监督RL代理的真实观察中,并使用一种闭式解来进行模仿学习,从而使RL代理能够模仿基础化的观察,我们可以实现零-shot语言到行为策略。我们的方法RLZero是我们所知道的第一个展示在模拟领域的各种任务上具有零-shot语言到行为生成能力的方法,而无需任何监督。我们进一步展示了RLZero还可以从跨体视频中进行零-shot生成策略,例如从YouTube上获取的视频。
English
Rewards remain an uninterpretable way to specify tasks for Reinforcement
Learning, as humans are often unable to predict the optimal behavior of any
given reward function, leading to poor reward design and reward hacking.
Language presents an appealing way to communicate intent to agents and bypass
reward design, but prior efforts to do so have been limited by costly and
unscalable labeling efforts. In this work, we propose a method for a completely
unsupervised alternative to grounding language instructions in a zero-shot
manner to obtain policies. We present a solution that takes the form of
imagine, project, and imitate: The agent imagines the observation sequence
corresponding to the language description of a task, projects the imagined
sequence to our target domain, and grounds it to a policy. Video-language
models allow us to imagine task descriptions that leverage knowledge of tasks
learned from internet-scale video-text mappings. The challenge remains to
ground these generations to a policy. In this work, we show that we can achieve
a zero-shot language-to-behavior policy by first grounding the imagined
sequences in real observations of an unsupervised RL agent and using a
closed-form solution to imitation learning that allows the RL agent to mimic
the grounded observations. Our method, RLZero, is the first to our knowledge to
show zero-shot language to behavior generation abilities without any
supervision on a variety of tasks on simulated domains. We further show that
RLZero can also generate policies zero-shot from cross-embodied videos such as
those scraped from YouTube.Summary
AI-Generated Summary