主題:潛在運動標記作為機器人操作的橋樑語言
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
December 5, 2024
作者: Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu
cs.AI
摘要
最近在大型語言模型上進行的預訓練,通過對廣泛語料庫的預訓練,在各種自然語言處理任務中取得了顯著成功,而只需進行最少的微調。這種成功為機器人技術帶來了新的希望,長期以來,機器人技術一直受制於高昂的動作標記數據成本。我們提出一個問題:鑒於豐富的包含互動相關知識的視頻數據作為豐富的“語料庫”,是否可以有效應用類似的生成式預訓練方法來增強機器人學習?關鍵挑戰在於確定一種對機器人操作任務有益的自回歸預訓練的有效表示。受人類通過觀察動態環境學習新技能的方式啟發,我們提出,有效的機器人學習應該強調與低級動作密切相關的運動相關知識,並且與硬件無關,有助於將學習到的運動轉移到實際機器人動作中。為此,我們引入了Moto,通過潛在運動標記生成器將視頻內容轉換為潛在運動標記序列,以無監督的方式從視頻中學習運動的連接“語言”。我們通過運動標記自回歸對Moto-GPT進行預訓練,使其能夠捕捉多樣的視覺運動知識。在預訓練之後,Moto-GPT展示了產生語義可解釋的運動標記、預測合理的運動軌跡以及通過輸出概率評估軌跡合理性的潛力。為了將學習到的運動先驗知識轉移到真實機器人動作中,我們實施了一種協同微調策略,無縫地橋接潛在運動標記預測和真實機器人控制。大量實驗表明,微調後的Moto-GPT在機器人操作基準測試中表現出優越的穩健性和效率,突顯了它在從視頻數據轉移到下游視覺操作任務中的有效性。
English
Recent developments in Large Language Models pre-trained on extensive corpora
have shown significant success in various natural language processing tasks
with minimal fine-tuning. This success offers new promise for robotics, which
has long been constrained by the high cost of action-labeled data. We ask:
given the abundant video data containing interaction-related knowledge
available as a rich "corpus", can a similar generative pre-training approach be
effectively applied to enhance robot learning? The key challenge is to identify
an effective representation for autoregressive pre-training that benefits robot
manipulation tasks. Inspired by the way humans learn new skills through
observing dynamic environments, we propose that effective robotic learning
should emphasize motion-related knowledge, which is closely tied to low-level
actions and is hardware-agnostic, facilitating the transfer of learned motions
to actual robot actions. To this end, we introduce Moto, which converts video
content into latent Motion Token sequences by a Latent Motion Tokenizer,
learning a bridging "language" of motion from videos in an unsupervised manner.
We pre-train Moto-GPT through motion token autoregression, enabling it to
capture diverse visual motion knowledge. After pre-training, Moto-GPT
demonstrates the promising ability to produce semantically interpretable motion
tokens, predict plausible motion trajectories, and assess trajectory
rationality through output likelihood. To transfer learned motion priors to
real robot actions, we implement a co-fine-tuning strategy that seamlessly
bridges latent motion token prediction and real robot control. Extensive
experiments show that the fine-tuned Moto-GPT exhibits superior robustness and
efficiency on robot manipulation benchmarks, underscoring its effectiveness in
transferring knowledge from video data to downstream visual manipulation tasks.Summary
AI-Generated Summary