以数据为中心重新审视预训练视觉模型在机器人学习中的应用
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
March 10, 2025
作者: Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi
cs.AI
摘要
预训练视觉模型(PVMs)是现代机器人技术的基石,然而其最佳配置仍不明确。通过系统性评估,我们发现,尽管DINO和iBOT在视觉运动控制与感知任务上优于MAE,但在非(单一)对象中心(NOC)数据上训练时表现欠佳——这一局限与其学习对象中心表示能力下降密切相关。研究表明,从非对象中心的机器人数据集中形成对象中心表示的能力,是PVMs成功的关键。受此启发,我们设计了SlotMIM方法,通过引入语义瓶颈减少原型数量以促进对象性的显现,并采用跨视图一致性正则化增强多视图不变性,从而诱导对象中心表示。我们的实验涵盖了对象中心、场景中心、网络爬取及自我中心数据的预训练。在所有设置下,我们的方法均能学习到可迁移的表示,并在图像识别、场景理解及机器人学习评估中较之前工作取得显著提升。当利用百万级数据集进行扩展时,我们的方法还展现了卓越的数据效率与可扩展性。我们的代码与模型已公开于https://github.com/CVMI-Lab/SlotMIM。
English
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet
their optimal configuration remains unclear. Through systematic evaluation, we
find that while DINO and iBOT outperform MAE across visuomotor control and
perception tasks, they struggle when trained on non-(single-)object-centric
(NOC) data--a limitation strongly correlated with their diminished ability to
learn object-centric representations. This investigation indicates that the
ability to form object-centric representations from the non-object-centric
robotics dataset is the key to success for PVMs. Motivated by this discovery,
we designed SlotMIM, a method that induces object-centric representations by
introducing a semantic bottleneck to reduce the number of prototypes to
encourage the emergence of objectness as well as cross-view consistency
regularization for encouraging multiview invariance. Our experiments encompass
pre-training on object-centric, scene-centric, web-crawled, and ego-centric
data. Across all settings, our approach learns transferrable representations
and achieves significant improvements over prior work in image recognition,
scene understanding, and robot learning evaluations. When scaled up with
million-scale datasets, our method also demonstrates superior data efficiency
and scalability. Our code and models are publicly available at
https://github.com/CVMI-Lab/SlotMIM.Summary
AI-Generated Summary