Gen2Act：在新奇場景中進行人類視訊生成，實現通用機器人操作

摘要

機器人操作策略如何能夠應用於涉及未見過的物件類型和新動作的新任務？本文提出了一種解決方案，即通過從網絡數據中預測運動信息，通過人類視頻生成並將機器人策略條件化為生成的視頻。我們展示了如何利用在易於獲得的網絡數據上訓練的視頻生成模型，而不是嘗試擴展昂貴的機器人數據收集，從而實現泛化。我們的方法Gen2Act將語言條件下的操作視為零樣本人類視頻生成，然後執行單一策略，該策略條件是生成的視頻。為了訓練該策略，我們使用的機器人交互數據比視頻預測模型訓練所用的數據少一個數量級。Gen2Act完全不需要對視頻模型進行微調，我們直接使用預先訓練的模型來生成人類視頻。我們在多樣的現實場景中的結果顯示了Gen2Act如何實現操作未見過的物件類型，執行新動作以完成機器人數據中不存在的任務。視頻可在https://homangab.github.io/gen2act/觀看。

English

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

Gen2Act：在新奇場景中進行人類視訊生成，實現通用機器人操作

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

摘要

Summary

Support

Support