TraceVLA：視覺追蹤提示增強廣義機器人策略的時空意識

摘要

儘管在龐大的視覺-語言-動作（VLA）模型上預訓練廣泛的機器人數據集上提供了有前途的通用策略，用於機器人學習，但它們仍然在互動機器人技術中的時空動態方面遇到困難，使它們在處理諸如操作等複雜任務時效果較差。在這項工作中，我們引入了視覺跟踪提示，這是一種簡單而有效的方法，通過以視覺方式編碼狀態-動作軌跡，來促進VLA模型對於動作預測的時空意識。我們通過在我們自己收集的15萬條機器人操作軌跡數據集上使用視覺跟踪提示對OpenVLA進行微調，開發了一個新的TraceVLA模型。在SimplerEnv的137種配置和4個物理WidowX機器人任務上對TraceVLA進行評估，展示了最先進的性能，其在SimplerEnv上比OpenVLA高出10％，在真實機器人任務上高出3.5倍，並且在不同具體表現和情境中表現出強大的泛化能力。為了進一步驗證我們方法的有效性和普遍性，我們提出了一個基於4B Phi-3-Vision的緊湊VLA模型，它在Open-X-Embodiment上預訓練，並在我們的數據集上進行微調，與7B的OpenVLA基線相媲美，同時顯著提高了推斷效率。

English

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

TraceVLA：視覺追蹤提示增強廣義機器人策略的時空意識

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

摘要

Summary

Support