GRAPE：通過偏好對齊來泛化機器人策略

摘要

儘管視覺-語言-動作（VLA）模型在各種機器人任務上取得了近期的進展，但由於完全依賴從成功的運行中進行行為複製，因此它們存在著一些關鍵問題，例如對未見過任務的泛化能力不佳。此外，它們通常被微調以複製專家在不同環境下收集的示範，進而引入分佈偏差，限制其適應各種操作目標，如效率、安全性和任務完成。為彌合這一差距，我們引入了GRAPE：通過偏好對齊來泛化機器人策略。具體而言，GRAPE在軌跡級別上對齊VLA，並從成功和失敗的試驗中隱式地建模獎勵，以提高對各種任務的泛化能力。此外，GRAPE將複雜的操作任務分解為獨立階段，並通過大型視覺-語言模型提出的關鍵點自動引導偏好建模，並通過自定義的時空約束。值得注意的是，這些約束是靈活的，可以根據不同目標（如安全性、效率或任務成功）來自定義，以對齊模型。我們在真實世界和模擬環境中對GRAPE進行了各種任務的評估。實驗結果表明，GRAPE提升了最先進的VLA模型的性能，使其在領域內和未見過的操作任務上的成功率分別提高了51.79%和60.36%。此外，GRAPE可以與各種目標（如安全性和效率）對齊，將碰撞率降低了44.31%，將運行步長縮短了11.15%。所有代碼、模型和數據均可在https://grape-vla.github.io/ 上獲得。

English

Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at https://grape-vla.github.io/

GRAPE：通過偏好對齊來泛化機器人策略

GRAPE: Generalizing Robot Policy via Preference Alignment

摘要

Summary

Support