高效扩散变换器策略与专家混合去噪器用于多任务学习
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
December 17, 2024
作者: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
cs.AI
摘要
扩散策略已广泛应用于模仿学习中,具有多模态和不连续行为生成等吸引人的特性。随着模型变得更大以捕捉更复杂的能力,其计算需求也在增加,正如最近的扩展规律所显示的那样。因此,继续采用当前的架构将带来计算上的障碍。为了填补这一差距,我们提出了一种新颖的模仿学习策略——混合去噪专家(MoDE)。MoDE超越了当前基于Transformer的扩散策略的最新技术,同时通过稀疏专家和噪声条件路由实现了参数高效扩展,通过专家缓存将活跃参数减少了40%,推理成本减少了90%。我们的架构将这种高效扩展与噪声条件自注意机制相结合,实现了在不同噪声水平下更有效的去噪。MoDE在四个已建立的模仿学习基准(CALVIN和LIBERO)中的134个任务上实现了最先进的性能。值得注意的是,通过在多样化的机器人数据上预训练MoDE,我们在CALVIN ABC上实现了4.01,在LIBERO-90上实现了0.95。它在4个基准测试中平均超越了基于CNN和Transformer的扩散策略57%,同时与默认的扩散Transformer架构相比,使用的FLOPs少了90%,活跃参数也更少。此外,我们对MoDE的组件进行了全面的消融实验,为设计高效可扩展的Transformer架构提供了见解,用于扩散策略。代码和演示可在https://mbreuss.github.io/MoDE_Diffusion_Policy/找到。
English
Diffusion Policies have become widely used in Imitation Learning, offering
several appealing properties, such as generating multimodal and discontinuous
behavior. As models are becoming larger to capture more complex capabilities,
their computational demands increase, as shown by recent scaling laws.
Therefore, continuing with the current architectures will present a
computational roadblock. To address this gap, we propose Mixture-of-Denoising
Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current
state-of-the-art Transformer-based Diffusion Policies while enabling
parameter-efficient scaling through sparse experts and noise-conditioned
routing, reducing both active parameters by 40% and inference costs by 90% via
expert caching. Our architecture combines this efficient scaling with
noise-conditioned self-attention mechanism, enabling more effective denoising
across different noise levels. MoDE achieves state-of-the-art performance on
134 tasks in four established imitation learning benchmarks (CALVIN and
LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01
on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and
Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while
using 90% fewer FLOPs and fewer active parameters compared to default Diffusion
Transformer architectures. Furthermore, we conduct comprehensive ablations on
MoDE's components, providing insights for designing efficient and scalable
Transformer architectures for Diffusion Policies. Code and demonstrations are
available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.Summary
AI-Generated Summary