最大化对齐性,最小化反馈:高效学习视觉运动机器人策略对齐的奖励
Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
December 6, 2024
作者: Ran Tian, Yilin Wu, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy
cs.AI
摘要
在大规模数据集上进行越来越多预训练的视觉运动机器人策略,承诺在机器人领域取得重大进展。然而,将这些策略与最终用户偏好对齐仍然是一个挑战,特别是当难以明确规定偏好时。虽然从人类反馈中进行强化学习(RLHF)已成为非实体领域(如大型语言模型)中对齐的主要机制,但由于学习视觉奖励函数所需的人类反馈数量庞大,它在对齐视觉运动策略方面并未取得同样成功。为了解决这一限制,我们提出了基于表示对齐的基于偏好学习(RAPL),这是一种仅通过观察学习视觉奖励的方法,需要更少的人类偏好反馈。与传统的RLHF不同,RAPL将人类反馈集中在微调预训练视觉编码器上,以使其与最终用户的视觉表示对齐,然后通过在这种对齐表示空间中进行特征匹配来构建密集的视觉奖励。我们首先通过在X-Magical基准和Franka Panda机器人操纵中进行的模拟实验验证了RAPL,展示它可以学习与人类偏好对齐的奖励,更有效地利用偏好数据,并且可以在机器人实体之间进行泛化。最后,我们通过硬件实验对三个物体操纵任务的预训练扩散策略进行了对齐。我们发现RAPL可以使用少至5倍的真实人类偏好数据微调这些策略,迈出了减少人类反馈同时最大化视觉运动机器人策略对齐的第一步。
English
Visuomotor robot policies, increasingly pre-trained on large-scale datasets,
promise significant advancements across robotics domains. However, aligning
these policies with end-user preferences remains a challenge, particularly when
the preferences are hard to specify. While reinforcement learning from human
feedback (RLHF) has become the predominant mechanism for alignment in
non-embodied domains like large language models, it has not seen the same
success in aligning visuomotor policies due to the prohibitive amount of human
feedback required to learn visual reward functions. To address this limitation,
we propose Representation-Aligned Preference-based Learning (RAPL), an
observation-only method for learning visual rewards from significantly less
human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback
on fine-tuning pre-trained vision encoders to align with the end-user's visual
representation and then constructs a dense visual reward via feature matching
in this aligned representation space. We first validate RAPL through simulation
experiments in the X-Magical benchmark and Franka Panda robotic manipulation,
demonstrating that it can learn rewards aligned with human preferences, more
efficiently uses preference data, and generalizes across robot embodiments.
Finally, our hardware experiments align pre-trained Diffusion Policies for
three object manipulation tasks. We find that RAPL can fine-tune these policies
with 5x less real human preference data, taking the first step towards
minimizing human feedback while maximizing visuomotor robot policy alignment.Summary
AI-Generated Summary