IGOR:图像目标表示是具身人工智能基础模型中的原子控制单元。
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI
October 17, 2024
作者: Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, Jiang Bian
cs.AI
摘要
我们引入了图像目标表示(IGOR),旨在学习一个统一的、语义一致的动作空间,涵盖人类和各种机器人。通过这个统一的潜在动作空间,IGOR实现了在大规模机器人和人类活动数据之间的知识转移。我们通过将初始图像与目标状态之间的视觉变化压缩为潜在动作来实现这一点。IGOR使我们能够为互联网规模的视频数据生成潜在动作标签。这个统一的潜在动作空间使得能够跨多种任务对机器人和人类执行的基础策略和世界模型进行训练。我们证明:(1)IGOR学习了一个对人类和机器人都具有语义一致性的动作空间,描述了代表物体物理交互知识的各种可能运动;(2)IGOR可以通过同时使用潜在动作模型和世界模型,“迁移”一个视频中物体的运动到其他视频,甚至跨越人类和机器人之间;(3)IGOR可以通过基础策略模型学习将潜在动作与自然语言对齐,并将潜在动作与低级策略模型整合,实现有效的机器人控制。我们相信IGOR为人类向机器人的知识转移和控制开辟了新的可能性。
English
We introduce Image-GOal Representations (IGOR), aiming to learn a unified,
semantically consistent action space across human and various robots. Through
this unified latent action space, IGOR enables knowledge transfer among
large-scale robot and human activity data. We achieve this by compressing
visual changes between an initial image and its goal state into latent actions.
IGOR allows us to generate latent action labels for internet-scale video data.
This unified latent action space enables the training of foundation policy and
world models across a wide variety of tasks performed by both robots and
humans. We demonstrate that: (1) IGOR learns a semantically consistent action
space for both human and robots, characterizing various possible motions of
objects representing the physical interaction knowledge; (2) IGOR can "migrate"
the movements of the object in the one video to other videos, even across human
and robots, by jointly using the latent action model and world model; (3) IGOR
can learn to align latent actions with natural language through the foundation
policy model, and integrate latent actions with a low-level policy model to
achieve effective robot control. We believe IGOR opens new possibilities for
human-to-robot knowledge transfer and control.Summary
AI-Generated Summary