摘要:潜在运动标记作为机器人操作的桥梁语言
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
December 5, 2024
作者: Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu
cs.AI
摘要
最近在广泛语料库上预训练的大型语言模型的发展在各种自然语言处理任务中取得了显著成功,而且只需进行最少的微调。这一成功为机器人技术带来了新的希望,长期以来,机器人技术一直受制于高成本的动作标记数据。我们提出一个问题:鉴于丰富的包含互动相关知识的视频数据作为一个丰富的“语料库”可用,类似的生成式预训练方法是否能够有效地应用于增强机器人学习?关键挑战在于确定一种有效的自回归预训练表示,以使机器人操纵任务受益。受人类通过观察动态环境学习新技能的方式启发,我们提出,有效的机器人学习应该强调与低级动作密切相关的运动相关知识,并且与硬件无关,便于将学到的运动转移到实际机器人动作中。为此,我们引入了Moto,通过潜在运动令牌分词器将视频内容转换为潜在的运动令牌序列,以无监督的方式从视频中学习运动的桥接“语言”。我们通过运动令牌自回归对Moto-GPT进行预训练,使其能够捕捉多样的视觉运动知识。在预训练之后,Moto-GPT展示了产生语义可解释的运动令牌、预测合理的运动轨迹以及通过输出可能性评估轨迹合理性的有希望能力。为了将学到的运动先验知识转移到实际机器人动作中,我们实施了一种协同微调策略,无缝地桥接了潜在运动令牌预测和真实机器人控制。大量实验表明,微调后的Moto-GPT在机器人操纵基准测试中表现出更高的稳健性和效率,突显了它在将知识从视频数据转移到下游视觉操纵任务中的有效性。
English
Recent developments in Large Language Models pre-trained on extensive corpora
have shown significant success in various natural language processing tasks
with minimal fine-tuning. This success offers new promise for robotics, which
has long been constrained by the high cost of action-labeled data. We ask:
given the abundant video data containing interaction-related knowledge
available as a rich "corpus", can a similar generative pre-training approach be
effectively applied to enhance robot learning? The key challenge is to identify
an effective representation for autoregressive pre-training that benefits robot
manipulation tasks. Inspired by the way humans learn new skills through
observing dynamic environments, we propose that effective robotic learning
should emphasize motion-related knowledge, which is closely tied to low-level
actions and is hardware-agnostic, facilitating the transfer of learned motions
to actual robot actions. To this end, we introduce Moto, which converts video
content into latent Motion Token sequences by a Latent Motion Tokenizer,
learning a bridging "language" of motion from videos in an unsupervised manner.
We pre-train Moto-GPT through motion token autoregression, enabling it to
capture diverse visual motion knowledge. After pre-training, Moto-GPT
demonstrates the promising ability to produce semantically interpretable motion
tokens, predict plausible motion trajectories, and assess trajectory
rationality through output likelihood. To transfer learned motion priors to
real robot actions, we implement a co-fine-tuning strategy that seamlessly
bridges latent motion token prediction and real robot control. Extensive
experiments show that the fine-tuned Moto-GPT exhibits superior robustness and
efficiency on robot manipulation benchmarks, underscoring its effectiveness in
transferring knowledge from video data to downstream visual manipulation tasks.Summary
AI-Generated Summary