ChatPaper.aiChatPaper

C3PO:關鍵層核心專家協同路徑優化 ——測試時專家重組方案

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

April 10, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI

摘要

混合專家(Mixture-of-Experts, MoE)大型語言模型(LLMs)存在嚴重的專家路徑次優問題——我們的研究揭示,從預訓練中學習到的樸素專家選擇留下了令人驚訝的10-20%的準確率提升空間。基於這一觀察,我們開發了一類新穎的測試時優化方法,用於為每個測試樣本在網絡的不同層中重新加權或“重新混合”專家。由於測試樣本的真實標籤未知,我們提出優化一個由參考樣本集中“成功鄰居”定義的代理目標。我們引入了三種基於模式發現、核迴歸以及相似參考樣本/任務平均損失的代理和算法。為了降低優化整個路徑的成本,我們僅將算法應用於關鍵層中核心專家的混合權重,這樣既能保持相似的性能,又能顯著節省計算資源。這催生了“關鍵層、核心專家、協同路徑優化(Critical-Layer, Core-Expert, Collaborative Pathway Optimization, C3PO)”。我們將C3PO應用於兩個最新的MoE LLMs,並在六個廣泛使用的基準上進行了測試。它持續將基礎模型的準確率提升了7-15%,並大幅超越了廣泛使用的測試時學習基線方法,例如上下文學習和提示/前綴調優。此外,C3PO使得具有1-3億活躍參數的MoE LLMs能夠超越7-9億參數的LLMs,從而進一步提升了MoE在效率上的優勢。我們全面的消融研究還為在MoE上實現測試時改進提供了新的見解。
English
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.

Summary

AI-Generated Summary

PDF583April 11, 2025