OTTER:具备文本感知视觉特征提取能力的视觉-语言-动作模型
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
March 5, 2025
作者: Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel
cs.AI
摘要
视觉-语言-动作(VLA)模型旨在根据视觉观察和语言指令预测机器人动作。现有方法需要对预训练的视觉语言模型(VLM)进行微调,因为视觉和语言特征被独立输入到下游策略中,这削弱了预训练的语义对齐效果。我们提出了OTTER,一种新颖的VLA架构,通过显式的、文本感知的视觉特征提取来利用这些现有的对齐关系。OTTER并非处理所有视觉特征,而是选择性地提取并仅传递与语言指令语义对齐的任务相关视觉特征至策略变换器。这使得OTTER能够保持预训练的视觉语言编码器冻结,从而保留并利用从大规模预训练中学到的丰富语义理解,实现强大的零样本泛化能力。在仿真和真实世界实验中,OTTER显著超越了现有的VLA模型,展示了对新物体和环境的强大零样本泛化能力。视频、代码、检查点和数据集请访问:https://ottervla.github.io/。
English
Vision-Language-Action (VLA) models aim to predict robotic actions based on
visual observations and language instructions. Existing approaches require
fine-tuning pre-trained visionlanguage models (VLMs) as visual and language
features are independently fed into downstream policies, degrading the
pre-trained semantic alignments. We propose OTTER, a novel VLA architecture
that leverages these existing alignments through explicit, text-aware visual
feature extraction. Instead of processing all visual features, OTTER
selectively extracts and passes only task-relevant visual features that are
semantically aligned with the language instruction to the policy transformer.
This allows OTTER to keep the pre-trained vision-language encoders frozen.
Thereby, OTTER preserves and utilizes the rich semantic understanding learned
from large-scale pre-training, enabling strong zero-shot generalization
capabilities. In simulation and real-world experiments, OTTER significantly
outperforms existing VLA models, demonstrating strong zeroshot generalization
to novel objects and environments. Video, code, checkpoints, and dataset:
https://ottervla.github.io/.Summary
AI-Generated Summary