最大化對齊性並以最少反饋為目標:有效學習視覺運動機器人政策對齊的獎勵

Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

December 6, 2024
作者: Ran Tian, Yilin Wu, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy
cs.AI

摘要

視覺運動機器人策略在大規模數據集上逐漸預先訓練,承諾在機器人領域取得顯著進展。然而,將這些策略與最終用戶偏好對齊仍然是一個挑戰,特別是當這些偏好難以明確指定時。儘管從人類反饋中進行強化學習(RLHF)已成為在非具體領域(如大型語言模型)中對齊的主要機制,但由於需要大量人類反饋才能學習視覺獎勵函數,它在對齊視覺運動策略方面並沒有取得同樣成功。為了解決這一限制,我們提出了基於表示對齊的基於偏好學習(RAPL),這是一種僅通過觀察學習視覺獎勵的方法,需要遠少於人類偏好反饋的數據。與傳統的RLHF不同,RAPL將人類反饋集中在微調預先訓練的視覺編碼器上,以使其與最終用戶的視覺表示對齊,然後在這個對齊的表示空間中通過特徵匹配構建密集的視覺獎勵。我們首先通過在X-Magical基準和Franka Panda機器人操作的模擬實驗中驗證了RAPL,展示它能夠學習與人類偏好對齊的獎勵,更有效地使用偏好數據,並且在機器人實體之間具有泛化能力。最後,我們通過硬件實驗對三個物體操作任務的預先訓練擴散策略進行對齊。我們發現RAPL可以通過比真實人類偏好數據少5倍的方式微調這些策略,從而邁出了最大程度對齊視覺運動機器人策略的步伐,同時最大程度地減少人類反饋。
English
Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.

Summary

AI-Generated Summary

PDF22December 11, 2024