視覚を超えて：異種センサーを用いた言語基盤を通じた汎用ロボットポリシーの微調整

要旨

世界との相互作用は、多感覚の経験です。効果的な汎用相互作用を達成するには、視覚、触覚、音声など、利用可能なすべてのモダリティを活用して、部分的な観察からの欠落を補う必要があります。例えば、視覚が遮られてバッグに手を伸ばす場合、ロボットは触覚と音声の感覚に頼るべきです。しかし、最先端の汎用ロボットポリシーは通常、視覚と固有感覚の観察だけからロボットの行動を予測するために大規模なデータセットでトレーニングされています。本研究では、大規模なデータセットがすぐに利用できない異種センサーモダリティに対して自然言語を共通のクロスモーダルな基盤として活用することで、FuSeという革新的なアプローチを提案します。我々は、高レベルの意味をエンコードするために、多モーダルなコントラスティブ損失と感覚に基づいた言語生成損失を組み合わせます。ロボット操作の文脈において、FuSeが、視覚、触覚、音声などのモダリティを共同で推論する必要がある難しいタスクを、マルチモーダルなプロンプティング、構成的クロスモーダルプロンプティング、および対話するオブジェクトの記述など、ゼロショット設定で実行できるようにすることを示します。同じ手法が、拡散ベースの汎用ポリシーや大規模なビジョン-言語-アクション（VLA）モデルを含む、広範な異なる汎用ポリシーにも適用可能であることを示します。実世界での幅広い実験結果は、FuSeが、すべての考慮されるベースラインに比べて成功率を20％以上向上させることができることを示しています。

English

Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.

視覚を超えて：異種センサーを用いた言語基盤を通じた汎用ロボットポリシーの微調整

Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding

要旨

Summary

Support