超越视觉:通过语言基础对异构传感器进行微调通用机器人策略

Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding

January 8, 2025
作者: Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Stachowicz, Pieter Abbeel, Sergey Levine
cs.AI

摘要

与世界互动是一个多感官体验:实现有效的通用交互需要利用所有可用的模态,包括视觉、触觉和音频,以填补部分观察的空白。例如,当视觉被遮挡时伸手进袋子,机器人应依赖触觉和声音感知。然而,目前最先进的通用机器人策略通常是在大型数据集上训练,仅从视觉和本体感知观察中预测机器人动作。在这项工作中,我们提出了FuSe,一种新方法,通过利用自然语言作为通用跨模态基础,使视觉运动通用策略在大型数据集不易获得的异质传感器模态上进行微调。我们将多模态对比损失与感知基础语言生成损失相结合,以编码高级语义。在机器人操作的背景下,我们展示了FuSe能够执行需要联合推理视觉、触觉和声音等模态的具有挑战性任务,例如多模态提示、组合跨模态提示和与之交互的对象描述等。我们展示了相同的方法适用于广泛不同的通用策略,包括基于扩散的通用策略和大型视觉-语言-动作(VLA)模型。在现实世界中进行的大量实验表明,与所有考虑的基线相比,FuSe能够将成功率提高超过20%。
English
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.

Summary

AI-Generated Summary

PDF22January 16, 2025