摘要：潜在运动标记作为机器人操作的桥梁语言

摘要

最近在广泛语料库上预训练的大型语言模型的发展在各种自然语言处理任务中取得了显著成功，而且只需进行最少的微调。这一成功为机器人技术带来了新的希望，长期以来，机器人技术一直受制于高成本的动作标记数据。我们提出一个问题：鉴于丰富的包含互动相关知识的视频数据作为一个丰富的“语料库”可用，类似的生成式预训练方法是否能够有效地应用于增强机器人学习？关键挑战在于确定一种有效的自回归预训练表示，以使机器人操纵任务受益。受人类通过观察动态环境学习新技能的方式启发，我们提出，有效的机器人学习应该强调与低级动作密切相关的运动相关知识，并且与硬件无关，便于将学到的运动转移到实际机器人动作中。为此，我们引入了Moto，通过潜在运动令牌分词器将视频内容转换为潜在的运动令牌序列，以无监督的方式从视频中学习运动的桥接“语言”。我们通过运动令牌自回归对Moto-GPT进行预训练，使其能够捕捉多样的视觉运动知识。在预训练之后，Moto-GPT展示了产生语义可解释的运动令牌、预测合理的运动轨迹以及通过输出可能性评估轨迹合理性的有希望能力。为了将学到的运动先验知识转移到实际机器人动作中，我们实施了一种协同微调策略，无缝地桥接了潜在运动令牌预测和真实机器人控制。大量实验表明，微调后的Moto-GPT在机器人操纵基准测试中表现出更高的稳健性和效率，突显了它在将知识从视频数据转移到下游视觉操纵任务中的有效性。

English

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.

摘要：潜在运动标记作为机器人操作的桥梁语言

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

摘要

Summary

Support

Support