TraceVLA:視覺追蹤提示增強廣義機器人策略的時空意識
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
December 13, 2024
作者: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
cs.AI
摘要
儘管在龐大的視覺-語言-動作(VLA)模型上預訓練廣泛的機器人數據集上提供了有前途的通用策略,用於機器人學習,但它們仍然在互動機器人技術中的時空動態方面遇到困難,使它們在處理諸如操作等複雜任務時效果較差。在這項工作中,我們引入了視覺跟踪提示,這是一種簡單而有效的方法,通過以視覺方式編碼狀態-動作軌跡,來促進VLA模型對於動作預測的時空意識。我們通過在我們自己收集的15萬條機器人操作軌跡數據集上使用視覺跟踪提示對OpenVLA進行微調,開發了一個新的TraceVLA模型。在SimplerEnv的137種配置和4個物理WidowX機器人任務上對TraceVLA進行評估,展示了最先進的性能,其在SimplerEnv上比OpenVLA高出10%,在真實機器人任務上高出3.5倍,並且在不同具體表現和情境中表現出強大的泛化能力。為了進一步驗證我們方法的有效性和普遍性,我們提出了一個基於4B Phi-3-Vision的緊湊VLA模型,它在Open-X-Embodiment上預訓練,並在我們的數據集上進行微調,與7B的OpenVLA基線相媲美,同時顯著提高了推斷效率。
English
Although large vision-language-action (VLA) models pretrained on extensive
robot datasets offer promising generalist policies for robotic learning, they
still struggle with spatial-temporal dynamics in interactive robotics, making
them less effective in handling complex tasks, such as manipulation. In this
work, we introduce visual trace prompting, a simple yet effective approach to
facilitate VLA models' spatial-temporal awareness for action prediction by
encoding state-action trajectories visually. We develop a new TraceVLA model by
finetuning OpenVLA on our own collected dataset of 150K robot manipulation
trajectories using visual trace prompting. Evaluations of TraceVLA across 137
configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate
state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and
3.5x on real-robot tasks and exhibiting robust generalization across diverse
embodiments and scenarios. To further validate the effectiveness and generality
of our method, we present a compact VLA model based on 4B Phi-3-Vision,
pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B
OpenVLA baseline while significantly improving inference efficiency.Summary
AI-Generated Summary