Gen2Act:在新奇場景中進行人類視訊生成,實現通用機器人操作
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
September 24, 2024
作者: Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani
cs.AI
摘要
機器人操作策略如何能夠應用於涉及未見過的物件類型和新動作的新任務?本文提出了一種解決方案,即通過從網絡數據中預測運動信息,通過人類視頻生成並將機器人策略條件化為生成的視頻。我們展示了如何利用在易於獲得的網絡數據上訓練的視頻生成模型,而不是嘗試擴展昂貴的機器人數據收集,從而實現泛化。我們的方法Gen2Act將語言條件下的操作視為零樣本人類視頻生成,然後執行單一策略,該策略條件是生成的視頻。為了訓練該策略,我們使用的機器人交互數據比視頻預測模型訓練所用的數據少一個數量級。Gen2Act完全不需要對視頻模型進行微調,我們直接使用預先訓練的模型來生成人類視頻。我們在多樣的現實場景中的結果顯示了Gen2Act如何實現操作未見過的物件類型,執行新動作以完成機器人數據中不存在的任務。視頻可在https://homangab.github.io/gen2act/觀看。
English
How can robot manipulation policies generalize to novel tasks involving
unseen object types and new motions? In this paper, we provide a solution in
terms of predicting motion information from web data through human video
generation and conditioning a robot policy on the generated video. Instead of
attempting to scale robot data collection which is expensive, we show how we
can leverage video generation models trained on easily available web data, for
enabling generalization. Our approach Gen2Act casts language-conditioned
manipulation as zero-shot human video generation followed by execution with a
single policy conditioned on the generated video. To train the policy, we use
an order of magnitude less robot interaction data compared to what the video
prediction model was trained on. Gen2Act doesn't require fine-tuning the video
model at all and we directly use a pre-trained model for generating human
videos. Our results on diverse real-world scenarios show how Gen2Act enables
manipulating unseen object types and performing novel motions for tasks not
present in the robot data. Videos are at https://homangab.github.io/gen2act/Summary
AI-Generated Summary