超越視覺:通過語言基礎實現異構感應器的通用機器人策略微調
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding
January 8, 2025
作者: Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Stachowicz, Pieter Abbeel, Sergey Levine
cs.AI
摘要
與世界互動是一種多感官體驗:實現有效的通用互動需要利用所有可用的模態,包括視覺、觸覺和聲音,以填補部分觀察的空白。例如,當視覺被遮擋時,伸手進袋子,機器人應依賴觸覺和聲音感官。然而,目前最先進的通用機器人策略通常是通過大型數據集訓練,僅從視覺和本體感知觀察來預測機器人動作。在這項工作中,我們提出了一種名為FuSe的新方法,通過利用自然語言作為共同的跨模態基礎,使視覺運動通用策略在不readily可用大型數據集的異構感測器模態上進行微調。我們結合多模態對比損失和感官基礎語言生成損失,以編碼高層次語義。在機器人操作的背景下,我們展示了FuSe能夠執行需要在視覺、觸覺和聲音等模態之間進行聯合推理的具有挑戰性任務,例如多模態提示、組合跨模態提示和與物體互動的描述等。我們展示了相同的方法適用於廣泛不同的通用策略,包括基於擴散的通用策略和大型視覺-語言-動作(VLA)模型。在現實世界中進行的大量實驗表明,與所有考慮的基準相比,FuSe能夠將成功率提高超過20%。
English
Interacting with the world is a multi-sensory experience: achieving effective
general-purpose interaction requires making use of all available modalities --
including vision, touch, and audio -- to fill in gaps from partial observation.
For example, when vision is occluded reaching into a bag, a robot should rely
on its senses of touch and sound. However, state-of-the-art generalist robot
policies are typically trained on large datasets to predict robot actions
solely from visual and proprioceptive observations. In this work, we propose
FuSe, a novel approach that enables finetuning visuomotor generalist policies
on heterogeneous sensor modalities for which large datasets are not readily
available by leveraging natural language as a common cross-modal grounding. We
combine a multimodal contrastive loss with a sensory-grounded language
generation loss to encode high-level semantics. In the context of robot
manipulation, we show that FuSe enables performing challenging tasks that
require reasoning jointly over modalities such as vision, touch, and sound in a
zero-shot setting, such as multimodal prompting, compositional cross-modal
prompting, and descriptions of objects it interacts with. We show that the same
recipe is applicable to widely different generalist policies, including both
diffusion-based generalist policies and large vision-language-action (VLA)
models. Extensive experiments in the real world show that FuSeis able to
increase success rates by over 20% compared to all considered baselines.Summary
AI-Generated Summary