マルチタスク学習のための専門家の混合を用いた効率的な拡散トランスフォーマーポリシー

要旨

拡散ポリシーは模倣学習で広く使用されるようになり、複数の魅力的な特性を提供しています。これには、多様なモーダルや不連続な振る舞いを生成するというものが含まれます。モデルがより複雑な能力を捉えるために大きくなるにつれ、その計算要求も増加し、最近のスケーリング則によって示されています。したがって、現在のアーキテクチャを継続すると、計算上の障害が発生します。このギャップを解消するために、模倣学習のための革新的なポリシーとして、Denoising Expertsの混合（MoDE）を提案します。MoDEは、スパースな専門家とノイズ条件付きのルーティングを介してパラメータの効率的なスケーリングを可能にしつつ、専門家のキャッシュによってアクティブパラメータを40%削減し、推論コストを90%削減します。当該アーキテクチャは、この効率的なスケーリングをノイズ条件付きの自己注意メカニズムと組み合わせ、異なるノイズレベルでのより効果的なノイズ除去を可能にします。MoDEは、4つの確立された模倣学習ベンチマーク（CALVINおよびLIBERO）の134のタスクで最先端のTransformerベースの拡散ポリシーを上回ります。特に、多様なロボティクスデータでMoDEを事前学習することで、CALVIN ABCでは4.01、LIBERO-90では0.95を達成します。MoDEは、4つのベンチマーク全体で、CNNベースとTransformer拡散ポリシーの両方を57%平均で上回り、デフォルトのDiffusion Transformerアーキテクチャと比較して、90%少ないFLOPsとアクティブパラメータを使用します。さらに、MoDEの構成要素について包括的な削減実験を行い、拡散ポリシーのための効率的でスケーラブルなTransformerアーキテクチャを設計するための洞察を提供します。コードとデモは、https://mbreuss.github.io/MoDE_Diffusion_Policy/ で入手可能です。

English

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

マルチタスク学習のための専門家の混合を用いた効率的な拡散トランスフォーマーポリシー

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

要旨

Summary

Support

Support