CARP:通过粗到细自回归预测实现视觉动作策略学习
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
December 9, 2024
作者: Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang
cs.AI
摘要
在机器人视觉动作策略学习中,基于扩散的模型在改善动作轨迹生成的准确性方面取得了显著成功,相较于传统的自回归模型。然而,它们由于多个去噪步骤和复杂约束的限制而效率低下。本文介绍了粗到细自回归策略(CARP),这是一种用于视觉动作策略学习的新范式,重新定义了自回归动作生成过程,将其作为一种粗到细、下一规模方法。CARP将动作生成分解为两个阶段:首先,动作自编码器学习整个动作序列的多尺度表示;然后,一个类似GPT风格的变压器通过粗到细的自回归过程对序列预测进行细化。这种直观简单的方法产生了高度准确且平滑的动作,与扩散式策略的性能相匹敌甚至超越,同时保持了与自回归策略相当的效率。我们在各种设置下进行了广泛评估,包括基于状态和基于图像的仿真基准上的单任务和多任务场景,以及真实世界任务。CARP取得了竞争性的成功率,提高了高达10%,并且相较于最先进的策略,推理速度提高了10倍,为机器人任务中动作生成建立了高性能、高效和灵活的范式。
English
In robotic visuomotor policy learning, diffusion-based models have achieved
significant success in improving the accuracy of action trajectory generation
compared to traditional autoregressive models. However, they suffer from
inefficiency due to multiple denoising steps and limited flexibility from
complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive
Policy (CARP), a novel paradigm for visuomotor policy learning that redefines
the autoregressive action generation process as a coarse-to-fine, next-scale
approach. CARP decouples action generation into two stages: first, an action
autoencoder learns multi-scale representations of the entire action sequence;
then, a GPT-style transformer refines the sequence prediction through a
coarse-to-fine autoregressive process. This straightforward and intuitive
approach produces highly accurate and smooth actions, matching or even
surpassing the performance of diffusion-based policies while maintaining
efficiency on par with autoregressive policies. We conduct extensive
evaluations across diverse settings, including single-task and multi-task
scenarios on state-based and image-based simulation benchmarks, as well as
real-world tasks. CARP achieves competitive success rates, with up to a 10%
improvement, and delivers 10x faster inference compared to state-of-the-art
policies, establishing a high-performance, efficient, and flexible paradigm for
action generation in robotic tasks.Summary
AI-Generated Summary