C3PO:關鍵層核心專家協同路徑優化 ——測試時專家重組方案
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing
April 10, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI
摘要
混合專家(Mixture-of-Experts, MoE)大型語言模型(LLMs)存在嚴重的專家路徑次優問題——我們的研究揭示,從預訓練中學習到的樸素專家選擇留下了令人驚訝的10-20%的準確率提升空間。基於這一觀察,我們開發了一類新穎的測試時優化方法,用於為每個測試樣本在網絡的不同層中重新加權或“重新混合”專家。由於測試樣本的真實標籤未知,我們提出優化一個由參考樣本集中“成功鄰居”定義的代理目標。我們引入了三種基於模式發現、核迴歸以及相似參考樣本/任務平均損失的代理和算法。為了降低優化整個路徑的成本,我們僅將算法應用於關鍵層中核心專家的混合權重,這樣既能保持相似的性能,又能顯著節省計算資源。這催生了“關鍵層、核心專家、協同路徑優化(Critical-Layer, Core-Expert, Collaborative Pathway Optimization, C3PO)”。我們將C3PO應用於兩個最新的MoE LLMs,並在六個廣泛使用的基準上進行了測試。它持續將基礎模型的準確率提升了7-15%,並大幅超越了廣泛使用的測試時學習基線方法,例如上下文學習和提示/前綴調優。此外,C3PO使得具有1-3億活躍參數的MoE LLMs能夠超越7-9億參數的LLMs,從而進一步提升了MoE在效率上的優勢。我們全面的消融研究還為在MoE上實現測試時改進提供了新的見解。
English
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely
sub-optimal expert pathways-our study reveals that naive expert selection
learned from pretraining leaves a surprising 10-20% accuracy gap for
improvement. Motivated by this observation, we develop a novel class of
test-time optimization methods to re-weight or "re-mixing" the experts in
different layers jointly for each test sample. Since the test sample's ground
truth is unknown, we propose to optimize a surrogate objective defined by the
sample's "successful neighbors" from a reference set of samples. We introduce
three surrogates and algorithms based on mode-finding, kernel regression, and
the average loss of similar reference samples/tasks. To reduce the cost of
optimizing whole pathways, we apply our algorithms merely to the core experts'
mixing weights in critical layers, which enjoy similar performance but save
significant computation. This leads to "Critical-Layer, Core-Expert,
Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE
LLMs and examine it on six widely-used benchmarks. It consistently improves the
base model by 7-15% in accuracy and outperforms widely used test-time learning
baselines, e.g., in-context learning and prompt/prefix tuning, by a large
margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to
outperform LLMs of 7-9B parameters, hence improving MoE's advantages on
efficiency. Our thorough ablation study further sheds novel insights on
achieving test-time improvement on MoE.Summary
AI-Generated Summary