具有專家混合去噪器的高效擴散Transformer政策,用於多任務學習。
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
December 17, 2024
作者: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
cs.AI
摘要
擴散策略在模仿學習中已被廣泛應用,具有多模態和不連續行為生成等吸引人的特性。隨著模型變得更大以捕捉更複雜的能力,其計算需求也隨之增加,正如最近的擴展定律所顯示的那樣。因此,繼續使用當前的架構將帶來計算上的障礙。為了解決這一差距,我們提出了一種新的模仿學習策略,即混合去噪專家(MoDE)。MoDE在實現參數高效擴展方面超越了當前最先進基於Transformer的擴散策略,同時通過稀疏專家和噪聲條件路由實現減少40%主動參數和通過專家緩存實現90%推理成本的效果。我們的架構將這種高效擴展與噪聲條件自注意機制相結合,實現在不同噪聲水平下更有效的去噪。MoDE在四個已建立的模仿學習基準測試(CALVIN和LIBERO)中的134個任務上實現了最先進的性能。值得注意的是,通過在多樣化機器人數據上預訓練MoDE,我們在CALVIN ABC上實現了4.01,在LIBERO-90上實現了0.95。它在4個基準測試中平均超越了基於CNN和Transformer的擴散策略57%,同時與默認的擴散Transformer架構相比,使用了90%更少的FLOPs和更少的主動參數。此外,我們對MoDE的組件進行了全面的消融實驗,為設計高效可擴展的Transformer架構提供了見解。代碼和演示可在https://mbreuss.github.io/MoDE_Diffusion_Policy/找到。
English
Diffusion Policies have become widely used in Imitation Learning, offering
several appealing properties, such as generating multimodal and discontinuous
behavior. As models are becoming larger to capture more complex capabilities,
their computational demands increase, as shown by recent scaling laws.
Therefore, continuing with the current architectures will present a
computational roadblock. To address this gap, we propose Mixture-of-Denoising
Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current
state-of-the-art Transformer-based Diffusion Policies while enabling
parameter-efficient scaling through sparse experts and noise-conditioned
routing, reducing both active parameters by 40% and inference costs by 90% via
expert caching. Our architecture combines this efficient scaling with
noise-conditioned self-attention mechanism, enabling more effective denoising
across different noise levels. MoDE achieves state-of-the-art performance on
134 tasks in four established imitation learning benchmarks (CALVIN and
LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01
on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and
Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while
using 90% fewer FLOPs and fewer active parameters compared to default Diffusion
Transformer architectures. Furthermore, we conduct comprehensive ablations on
MoDE's components, providing insights for designing efficient and scalable
Transformer architectures for Diffusion Policies. Code and demonstrations are
available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.Summary
AI-Generated Summary