TraceVLA：视觉追踪提示增强了通用机器人策略的时空意识。

摘要

尽管在广泛的机器人数据集上预训练的大型视觉-语言-动作（VLA）模型为机器人学习提供了有前途的通用策略，但它们仍然在交互式机器人技术中的时空动态方面遇到困难，使其在处理复杂任务（如操作）时效果不佳。在这项工作中，我们引入了视觉追踪提示，这是一种简单而有效的方法，通过将状态-动作轨迹在视觉上进行编码，以促进VLA模型对动作预测的时空意识。我们通过在我们自己收集的15万个机器人操作轨迹数据集上使用视觉追踪提示对OpenVLA进行微调，开发了一种新的TraceVLA模型。在SimplerEnv的137个配置和物理WidowX机器人上的4个任务中对TraceVLA的评估表明，其表现达到了最先进水平，在SimplerEnv上比OpenVLA高出10％，在真实机器人任务上高出3.5倍，并且在不同具象和场景中表现出强大的泛化能力。为了进一步验证我们方法的有效性和普适性，我们提出了基于4B Phi-3-Vision的紧凑型VLA模型，该模型在Open-X-Embodiment上预训练，并在我们的数据集上进行微调，与7B的OpenVLA基线相媲美，同时显著提高了推理效率。

English

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

TraceVLA：视觉追踪提示增强了通用机器人策略的时空意识。

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

摘要

Summary

Support