专家联盟：将层级路由机制适配于等效分解的Transformer架构

摘要

专家混合模型（Mixture-of-Experts, MoE）在保持计算效率的同时提升了模型性能，使其非常适合大规模应用。然而，现有MoE范式中的专家各自独立工作，缺乏高质量的专家互动。此外，它们尚未有效扩展到注意力模块，这限制了效率的进一步提升。为解决这些问题，我们提出了专家联盟（Union-of-Experts, UoE），它将Transformer分解为等效的专家群组，并在输入数据和专家之间实施动态路由。我们的方法通过三项关键创新推进了MoE设计：（1）基于张量并行中的矩阵划分，我们对MLP模块和注意力模块进行了等效专家分解。（2）我们开发了两种路由范式：按片数据选择和专家选择，以在不同层级应用路由。（3）我们设计了UoE模型的架构，包括选择性多头注意力（Selective Multi-Head Attention, SMHA）和MLP专家联盟（Union-of-MLP-Experts, UoME）。（4）我们实现了UoE路由与计算操作的并行化，并基于硬件处理分析优化了效率。实验表明，采用UoE的模型在图像和自然语言领域的多项任务中超越了全注意力机制、当前最先进的MoE及高效Transformer。源代码已发布于https://github.com/YujiaoYang-work/UoE。

English

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

专家联盟：将层级路由机制适配于等效分解的Transformer架构

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

摘要

Summary

Support

Support