대규모 인간 비디오로부터 학습한 보편적 인간형 자세 제어

초록

인간형 로봇의 확장 가능한 학습은 현실 세계 응용에서 그들의 배치에 중요하다. 전통적인 접근 방식은 주로 강화 학습 또는 원격 조작을 활용하여 전신 제어를 달성하지만, 이러한 방법은 종종 시뮬레이션 환경의 다양성과 데모 수집의 높은 비용으로 제한된다. 반면에 인간 비디오는 흔하며 의미론적 및 동작 정보의 미활용된 원천으로서 인간형 로봇의 일반화 능력을 크게 향상시킬 수 있다. 본 논문은 이 풍부한 데이터를 활용하기 위해 설계된 2천만 개 이상의 인간형 로봇 포즈와 해당 텍스트 기반 동작 설명을 갖는 대규모 데이터셋인 Humanoid-X를 소개한다. Humanoid-X는 인터넷에서 데이터 마이닝, 비디오 캡션 생성, 인간의 동작을 인간형 로봇으로 재지정하고 현실 세계 배치를 위한 정책 학습을 통해 선별된다. Humanoid-X를 사용하여 텍스트 지시를 입력으로 취하고 인간형 로봇을 제어하기 위한 해당 작업을 출력하는 대규모 인간형 모델 UH-1을 추가로 훈련한다. 광범위한 시뮬레이션 및 현실 세계 실험을 통해 우리의 확장 가능한 훈련 접근 방식이 텍스트 기반 인간형 제어에서 우수한 일반화로 이어지며, 적응 가능하고 현실 세계에 준비된 인간형 로봇으로의 중요한 한걸음을 나타낸다.

English

Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement learning or teleoperation to achieve whole-body control, they are often limited by the diversity of simulated environments and the high costs of demonstration collection. In contrast, human videos are ubiquitous and present an untapped source of semantic and motion information that could significantly enhance the generalization capabilities of humanoid robots. This paper introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot poses with corresponding text-based motion descriptions, designed to leverage this abundant data. Humanoid-X is curated through a comprehensive pipeline: data mining from the Internet, video caption generation, motion retargeting of humans to humanoid robots, and policy learning for real-world deployment. With Humanoid-X, we further train a large humanoid model, UH-1, which takes text instructions as input and outputs corresponding actions to control a humanoid robot. Extensive simulated and real-world experiments validate that our scalable training approach leads to superior generalization in text-based humanoid control, marking a significant step toward adaptable, real-world-ready humanoid robots.

대규모 인간 비디오로부터 학습한 보편적 인간형 자세 제어

Learning from Massive Human Videos for Universal Humanoid Pose Control

초록

Summary

Support

Support