Gen2Act: 새로운 시나리오에서의 인간 비디오 생성은 일반화된 로봇 조작을 가능하게 합니다.

초록

로봇 조작 정책이 보이지 않는 객체 유형 및 새로운 동작을 포함하는 새로운 작업에 대해 어떻게 일반화될 수 있을까요? 본 논문에서는 웹 데이터로부터 움직임 정보를 예측하고 로봇 정책을 생성된 비디오에 조건을 걸어 제시합니다. 비싼 로봇 데이터 수집을 확장하려는 대신, 우리는 일반화를 가능하게 하는 쉽게 이용 가능한 웹 데이터에서 훈련된 비디오 생성 모델을 활용하는 방법을 보여줍니다. 우리의 접근 방식인 Gen2Act는 제로샷 인간 비디오 생성으로 조건부 조작을 구현하고 생성된 비디오에 조건이 걸린 단일 정책으로 실행합니다. 정책을 훈련하기 위해 우리는 비디오 예측 모델이 훈련된 것보다 한 단계 낮은 수준의 로봇 상호 작용 데이터를 사용합니다. Gen2Act는 비디오 모델을 전혀 미세 조정할 필요가 없으며 인간 비디오를 생성하기 위해 사전 훈련된 모델을 직접 사용합니다. 다양한 실제 시나리오에서의 결과는 Gen2Act가 로봇 데이터에 없는 작업을 위해 보이지 않는 객체 유형을 조작하고 새로운 동작을 수행하는 능력을 보여줍니다. 비디오는 https://homangab.github.io/gen2act/에서 확인할 수 있습니다.

English

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

Gen2Act: 새로운 시나리오에서의 인간 비디오 생성은 일반화된 로봇 조작을 가능하게 합니다.

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

초록

Summary

Support

Support