基于四维表示的预训练自回归机器人模型
Pre-training Auto-regressive Robotic Models with 4D Representations
February 18, 2025
作者: Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig
cs.AI
摘要
基于海量无标签数据预训练的基础模型已在自然语言处理和计算机视觉领域引发革命,展现出卓越的泛化能力,从而突显了预训练的重要性。然而,在机器人领域,类似成就的取得却面临挑战,主要受限于昂贵的机器人标注需求或缺乏能有效模拟物理世界的表示方法。本文提出ARM4R,一种自回归机器人模型,它利用从人类视频数据中学习的低层次四维表示,以构建更优的预训练机器人模型。具体而言,我们专注于利用通过单目深度估计随时间将二维表示提升至三维空间而获得的视频中的三维点追踪表示。这些四维表示在点与机器人状态表示之间保持共享的几何结构,直至线性变换,从而实现了从人类视频数据到低层次机器人控制的高效迁移学习。实验表明,ARM4R能够高效地从人类视频数据迁移至机器人应用,并在多种机器人环境和配置的任务上持续提升性能。
English
Foundation models pre-trained on massive unlabeled datasets have
revolutionized natural language and computer vision, exhibiting remarkable
generalization capabilities, thus highlighting the importance of pre-training.
Yet, efforts in robotics have struggled to achieve similar success, limited by
either the need for costly robotic annotations or the lack of representations
that effectively model the physical world. In this paper, we introduce ARM4R,
an Auto-regressive Robotic Model that leverages low-level 4D Representations
learned from human video data to yield a better pre-trained robotic model.
Specifically, we focus on utilizing 3D point tracking representations from
videos derived by lifting 2D representations into 3D space via monocular depth
estimation across time. These 4D representations maintain a shared geometric
structure between the points and robot state representations up to a linear
transformation, enabling efficient transfer learning from human video data to
low-level robotic control. Our experiments show that ARM4R can transfer
efficiently from human video data to robotics and consistently improves
performance on tasks across various robot environments and configurations.Summary
AI-Generated Summary