다중 작업 학습을 위한 전문가 혼합 제거기를 사용한 효율적인 확산 트랜스포머 정책

초록

확산 정책은 모방 학습에서 널리 사용되어 다중 모달 및 불연속 행동을 생성하는 등 매력적인 특성을 제공합니다. 모델이 더 복잡한 능력을 포착하기 위해 점점 커지면서 최근 스케일링 법칙에 따라 계산 요구가 증가합니다. 따라서 현재의 아키텍처를 유지하면 계산적인 장애물이 발생할 것입니다. 이 간극을 해결하기 위해 우리는 모방 학습을 위한 혁신적인 정책으로 Mixture-of-Denoising Experts (MoDE)를 제안합니다. MoDE는 희소 전문가와 노이즈 조건부 라우팅을 통해 파라미터 효율적인 스케일링을 가능하게 하면서 활성 파라미터를 40% 줄이고 전문가 캐싱을 통해 추론 비용을 90% 줄입니다. 우리의 아키텍처는 이 효율적인 스케일링을 노이즈 조건부 셀프 어텐션 메커니즘과 결합하여 다양한 노이즈 수준에서 보다 효과적인 노이즈 제거를 가능하게 합니다. MoDE는 CALVIN과 LIBERO의 네 가지 확립된 모방 학습 벤치마크에서 134가지 작업에서 최고 수준의 성능을 달성합니다. 특히, 다양한 로봇 데이터로 MoDE를 사전 훈련하여 CALVIN ABC에서 4.01, LIBERO-90에서 0.95를 달성합니다. 4개 벤치마크 전반에 걸쳐 CNN 기반 및 Transformer 확산 정책보다 57% 평균으로 뛰어나면서 기본 확산 Transformer 아키텍처에 비해 90% 더 적은 FLOP 및 활성 파라미터를 사용합니다. 더 나아가, MoDE의 구성 요소에 대한 포괄적인 제거 실험을 수행하여 확산 정책을 위한 효율적이고 확장 가능한 Transformer 아키텍처를 설계하는 통찰을 제공합니다. 코드 및 데모는 https://mbreuss.github.io/MoDE_Diffusion_Policy/에서 확인할 수 있습니다.

English

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

다중 작업 학습을 위한 전문가 혼합 제거기를 사용한 효율적인 확산 트랜스포머 정책

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

초록

Summary

Support

Support