CARP: 시각 운동 정책 학습을 위한 Coarse-to-Fine 자기회귀 예측

초록

로봇 시각운동 정책 학습에서 확산 기반 모델은 전통적인 자기회귀 모델과 비교하여 행동 궤적 생성의 정확도를 향상시키는 데 상당한 성과를 거두었습니다. 그러나 이러한 모델은 여러 개의 노이즈 제거 단계로 인한 비효율성과 복잡한 제약 조건으로 인한 유연성 제한으로 고통받고 있습니다. 본 논문에서는 시각운동 정책 학습을 위한 혁신적인 패러다임인 Coarse-to-Fine AutoRegressive Policy (CARP)를 소개합니다. CARP는 자기회귀 행동 생성 과정을 거친-미세, 다음-규모 접근 방식으로 재정의합니다. CARP는 행동 생성을 두 단계로 분리합니다. 먼저, 행동 오토인코더가 전체 행동 순서의 다중 규모 표현을 학습하고, 그런 다음 GPT 스타일의 트랜스포머가 거친-미세 자기회귀 과정을 통해 순서 예측을 정제합니다. 이 직관적이고 직접적인 방법은 매우 정확하고 부드러운 행동을 생성하며, 효율성 면에서 자기회귀 정책과 유사한 수준을 유지하면서 확산 기반 정책의 성능을 맞거나 능가합니다. 우리는 상태 기반 및 이미지 기반 시뮬레이션 벤치마크 및 실제 과제를 포함한 다양한 환경에서 철저한 평가를 실시했습니다. CARP는 경쟁력 있는 성공률을 달성하며 최대 10%의 향상을 보여주며 최첨단 정책과 비교하여 10배 빠른 추론 속도를 제공하여 로봇 작업에서 행동 생성을 위한 고효율성, 효과적이고 유연한 패러다임을 확립합니다.

English

In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.

CARP: 시각 운동 정책 학습을 위한 Coarse-to-Fine 자기회귀 예측

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

초록

Summary

Support