从大规模人类视频中学习,实现通用人形姿势控制
Learning from Massive Human Videos for Universal Humanoid Pose Control
December 18, 2024
作者: Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, Yue Wang
cs.AI
摘要
用于可扩展学习的人形机器人对于它们在现实世界应用中的部署至关重要。虽然传统方法主要依赖于强化学习或远程操作来实现全身控制,但通常受限于模拟环境的多样性和演示收集的高成本。相比之下,人类视频普遍存在,并且是一种未被开发的语义和动作信息来源,可以显著增强人形机器人的泛化能力。本文介绍了Humanoid-X,这是一个包含超过2000万个人形机器人姿势及相应基于文本的动作描述的大规模数据集,旨在利用这一丰富的数据。Humanoid-X通过一个全面的流程策划:从互联网进行数据挖掘,生成视频标题,将人类的动作重新定位到人形机器人上,并进行用于现实世界部署的策略学习。利用Humanoid-X,我们进一步训练了一个大型人形模型UH-1,该模型以文本指令作为输入,并输出相应的动作以控制人形机器人。广泛的模拟和现实世界实验验证了我们的可扩展训练方法在基于文本的人形控制中具有卓越的泛化能力,标志着朝着适应性强、现实世界可用的人形机器人迈出了重要一步。
English
Scalable learning of humanoid robots is crucial for their deployment in
real-world applications. While traditional approaches primarily rely on
reinforcement learning or teleoperation to achieve whole-body control, they are
often limited by the diversity of simulated environments and the high costs of
demonstration collection. In contrast, human videos are ubiquitous and present
an untapped source of semantic and motion information that could significantly
enhance the generalization capabilities of humanoid robots. This paper
introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot
poses with corresponding text-based motion descriptions, designed to leverage
this abundant data. Humanoid-X is curated through a comprehensive pipeline:
data mining from the Internet, video caption generation, motion retargeting of
humans to humanoid robots, and policy learning for real-world deployment. With
Humanoid-X, we further train a large humanoid model, UH-1, which takes text
instructions as input and outputs corresponding actions to control a humanoid
robot. Extensive simulated and real-world experiments validate that our
scalable training approach leads to superior generalization in text-based
humanoid control, marking a significant step toward adaptable, real-world-ready
humanoid robots.Summary
AI-Generated Summary