Robots trainen Robots voor: Manipulatie-gerichte Robotrepresentatie van Groot-Schalige Robot Dataset

Samenvatting

Het vooraf trainen van visuele representaties heeft de efficiëntie van robotleren verbeterd. Vanwege het gebrek aan grootschalige in-domein robotdatasets maken eerdere werken gebruik van in-the-wild menselijke video's om robotvisuele representatie vooraf te trainen. Ondanks hun veelbelovende resultaten zijn representaties van menselijke video's onvermijdelijk onderhevig aan distributieverschuivingen en ontbreekt de cruciale dynamische informatie voor taakvoltooiing. We evalueren eerst verschillende vooraf getrainde representaties wat betreft hun correlatie met de downstream robotmanipulatietaken (d.w.z. manipulatiegerichtheid). Interessant genoeg ontdekken we dat "manipulatiegerichtheid" een sterke indicator is voor succespercentages bij toepassing op downstream taken. Puttend uit deze bevindingen stellen we Manipulatiegerichte Representatie (MCR) voor, een basisrepresentatie-leerframework dat zowel visuele kenmerken als de dynamische informatie zoals acties en propriocepties van manipulatietaken vastlegt om manipulatiegerichtheid te verbeteren. Specifiek trainen we een visuele encoder voor op de DROID robotdataset en maken gebruik van bewegingsrelevante gegevens zoals robotproprioceptieve toestanden en acties. We introduceren een nieuw contrastief verlies dat visuele observaties afstemt op de proprioceptieve toestandsactiedynamiek van de robot, gecombineerd met een actorverlies zoals bij gedragsklonen (BC) om acties te voorspellen tijdens de voorafgaande training, samen met een tijdscontrastief verlies. Empirische resultaten over 4 simulatiedomeinen met 20 taken bevestigen dat MCR de sterkste basismethode met 14,8% overtreft. Bovendien verbetert MCR de prestaties van data-efficiënt leren met een UR5e-arm op 3 real-world taken met 76,9%. Projectwebsite: https://robots-pretrain-robots.github.io/.

English

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: https://robots-pretrain-robots.github.io/.

Robots trainen Robots voor: Manipulatie-gerichte Robotrepresentatie van Groot-Schalige Robot Dataset

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

Samenvatting

Support