Roboter trainieren Roboter vor: Manipulationszentrierte Roboterrepräsentation aus einem umfangreichen Roboterdatensatz

Zusammenfassung

Die Vortrainierung von visuellen Repräsentationen hat die Effizienz des Roboterlernens verbessert. Aufgrund des Mangels an groß angelegten in-domain Roboterdatensätzen nutzen frühere Arbeiten in freier Wildbahn aufgenommene menschliche Videos zur Vortrainierung der visuellen Roboterrepräsentation. Trotz vielversprechender Ergebnisse unterliegen Repräsentationen aus menschlichen Videos zwangsläufig Verteilungsverschiebungen und es fehlt die für die Aufgabenerfüllung entscheidende dynamische Information. Wir evaluieren zunächst verschiedene vortrainierte Repräsentationen hinsichtlich ihrer Korrelation mit den nachgelagerten robotergesteuerten Manipulationstätigkeiten (d.h. Manipulationszentriertheit). Interessanterweise stellen wir fest, dass die "Manipulationszentriertheit" ein starker Indikator für den Erfolg bei der Anwendung auf nachgelagerte Aufgaben ist. Basierend auf diesen Erkenntnissen schlagen wir die Manipulationszentrierte Repräsentation (MCR) vor, ein Grundlagen-Repräsentationslernframework, das sowohl visuelle Merkmale als auch die dynamische Information wie Aktionen und Eigenwahrnehmungen von Manipulationstätigkeiten erfasst, um die Manipulationszentriertheit zu verbessern. Konkret vortrainieren wir einen visuellen Encoder auf dem DROID-Roboterdatensatz und nutzen bewegungsrelevante Daten wie die Roboter-eigenen propriozeptiven Zustände und Aktionen. Wir führen einen neuartigen kontrastiven Verlust ein, der visuelle Beobachtungen mit der propriozeptiven Zustands-Aktionsdynamik des Roboters in Einklang bringt, kombiniert mit einem Behavior Cloning (BC)-ähnlichen Aktorverlust zur Vorhersage von Aktionen während der Vortrainierung, zusammen mit einem zeitkontrastiven Verlust. Empirische Ergebnisse über 4 Simulationsdomänen mit 20 Aufgaben bestätigen, dass MCR die stärkste Basismethode um 14,8% übertrifft. Darüber hinaus steigert MCR die Leistung des dateneffizienten Lernens mit einem UR5e-Arm bei 3 realen Aufgaben um 76,9%. Projektwebsite: https://robots-pretrain-robots.github.io/.

English

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: https://robots-pretrain-robots.github.io/.

Roboter trainieren Roboter vor: Manipulationszentrierte Roboterrepräsentation aus einem umfangreichen Roboterdatensatz

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

Zusammenfassung

Summary

Support