CARP：透過粗到細的自回歸預測學習視覺運動策略

摘要

在機器人視覺運動策略學習中，擴散式模型相較於傳統的自回歸模型，在提高動作軌跡生成準確性方面取得了顯著成功。然而，由於多個去噪步驟和複雜約束的限制，它們存在效率問題。本文介紹了粗到細自回歸策略（CARP），這是一種重新定義自回歸動作生成過程為粗到細、下一級方法的創新範式，用於視覺運動策略學習。CARP將動作生成分解為兩個階段：首先，一個動作自編碼器學習整個動作序列的多尺度表示；然後，一個類似GPT風格的變壓器通過粗到細的自回歸過程對序列預測進行細化。這種直觀且直觀的方法產生高度準確且平滑的動作，與擴散式策略的表現相匹敵甚至超越，同時保持了與自回歸策略相當的效率。我們在各種場景下進行了廣泛評估，包括基於狀態和基於圖像的模擬基準測試中的單任務和多任務情況，以及現實任務。CARP實現了具有競爭力的成功率，最高提高了10％，並且與最先進的策略相比，推理速度提高了10倍，為機器人任務中的動作生成建立了高性能、高效和靈活的範式。

English

In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.

CARP：透過粗到細的自回歸預測學習視覺運動策略

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

摘要

Summary

Support