KUDA: Keypoints zur Vereinheitlichung von Dynamiklernen und visuellem Prompting für die offen-vokabuläre robotische Manipulation

Zusammenfassung

Mit dem rasanten Fortschritt großer Sprachmodelle (LLMs) und visuell-sprachlicher Modelle (VLMs) wurden bedeutende Fortschritte bei der Entwicklung offener Vokabular-Robotermanipulationssysteme erzielt. Viele bestehende Ansätze übersehen jedoch die Bedeutung der Objektdynamik, was ihre Anwendbarkeit auf komplexere, dynamische Aufgaben einschränkt. In dieser Arbeit stellen wir KUDA vor, ein offenes Vokabular-Manipulationssystem, das Dynamiklernen und visuelle Prompting durch Keypoints integriert und dabei sowohl VLMs als auch lernbasierte neuronale Dynamikmodelle nutzt. Unsere zentrale Erkenntnis ist, dass eine keypoint-basierte Zielangabe sowohl von VLMs interpretierbar ist als auch effizient in Kostenfunktionen für modellbasierte Planung übersetzt werden kann. Bei gegebenen Sprachanweisungen und visuellen Beobachtungen weist KUDA zunächst Keypoints dem RGB-Bild zu und befragt das VLM, um Zielangaben zu generieren. Diese abstrakten keypoint-basierten Darstellungen werden dann in Kostenfunktionen umgewandelt, die mithilfe eines gelernten Dynamikmodells optimiert werden, um Robotertrajektorien zu erzeugen. Wir evaluieren KUDA in einer Reihe von Manipulationsaufgaben, darunter freie Sprachanweisungen über diverse Objektkategorien, Multi-Objekt-Interaktionen sowie deformierbare oder granulare Objekte, und demonstrieren die Effektivität unseres Frameworks. Die Projektseite ist unter http://kuda-dynamics.github.io verfügbar.

English

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

KUDA: Keypoints zur Vereinheitlichung von Dynamiklernen und visuellem Prompting für die offen-vokabuläre robotische Manipulation

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

Zusammenfassung

Summary

Support

Support