從大量人類影片中學習,用於通用人形姿勢控制
Learning from Massive Human Videos for Universal Humanoid Pose Control
December 18, 2024
作者: Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, Yue Wang
cs.AI
摘要
對於在現實應用中部署的人形機器人來說,可擴展的學習至關重要。傳統方法主要依賴於強化學習或遠端操作來實現全身控制,但往往受限於模擬環境的多樣性和示範數據收集的高成本。相比之下,人類影片普遍存在並且是一個未被開發的語義和動作信息來源,可以顯著增強人形機器人的泛化能力。本文介紹了Humanoid-X,這是一個包含超過 2000 萬個人形機器人姿勢及相應基於文本的運動描述的大規模數據集,旨在利用這一豐富數據。Humanoid-X 通過一個全面的流程進行策劃:從互聯網進行數據挖掘,生成影片標題,將人類運動重新定位到人形機器人,以及為現實世界部署進行策略學習。利用Humanoid-X,我們進一步訓練了一個大型人形模型 UH-1,該模型將文本指令作為輸入,並輸出相應的動作以控制人形機器人。廣泛的模擬和現實世界實驗證實,我們的可擴展訓練方法在基於文本的人形控制中具有卓越的泛化能力,標誌著邁向適應性、現實應用就緒的人形機器人的重要一步。
English
Scalable learning of humanoid robots is crucial for their deployment in
real-world applications. While traditional approaches primarily rely on
reinforcement learning or teleoperation to achieve whole-body control, they are
often limited by the diversity of simulated environments and the high costs of
demonstration collection. In contrast, human videos are ubiquitous and present
an untapped source of semantic and motion information that could significantly
enhance the generalization capabilities of humanoid robots. This paper
introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot
poses with corresponding text-based motion descriptions, designed to leverage
this abundant data. Humanoid-X is curated through a comprehensive pipeline:
data mining from the Internet, video caption generation, motion retargeting of
humans to humanoid robots, and policy learning for real-world deployment. With
Humanoid-X, we further train a large humanoid model, UH-1, which takes text
instructions as input and outputs corresponding actions to control a humanoid
robot. Extensive simulated and real-world experiments validate that our
scalable training approach leads to superior generalization in text-based
humanoid control, marking a significant step toward adaptable, real-world-ready
humanoid robots.Summary
AI-Generated Summary