ChatPaper.aiChatPaper

C3PO:关键层核心专家协同路径优化,实现测试时专家重组

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

April 10, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI

摘要

混合专家(Mixture-of-Experts, MoE)大型语言模型(LLMs)在专家路径选择上存在显著不足——我们的研究表明,预训练阶段习得的简单专家选择策略留下了令人惊讶的10-20%的准确率提升空间。基于这一观察,我们开发了一类新颖的测试时优化方法,旨在针对每个测试样本,联合重新加权或“重新混合”不同层中的专家。由于测试样本的真实标签未知,我们提出优化一个由参考样本集中“成功邻居”定义的替代目标。我们引入了三种基于模式发现、核回归以及相似参考样本/任务平均损失的替代目标和相应算法。为了降低优化整个路径的成本,我们仅将这些算法应用于关键层中核心专家的混合权重,这样既保持了相似的性能,又显著节省了计算资源。这一方法被命名为“关键层、核心专家、协作路径优化(Critical-Layer, Core-Expert, Collaborative Pathway Optimization, C3PO)”。我们将C3PO应用于两个最新的MoE LLMs,并在六个广泛使用的基准测试上进行了验证。它持续将基础模型的准确率提升了7-15%,并大幅超越了广泛使用的测试时学习基线方法,如上下文学习、提示/前缀调优等。此外,C3PO使得仅激活1-3B参数的MoE LLMs能够超越7-9B参数的LLMs,从而进一步凸显了MoE在效率上的优势。我们全面的消融研究还揭示了在MoE上实现测试时改进的新见解。
English
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.

Summary

AI-Generated Summary

PDF613April 11, 2025