R2-T2:测试时重路由的多模态专家混合模型
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
February 27, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI
摘要
在大型多模态模型(LMMs)中,非语言模态(如视觉表征)的感知能力通常无法与大型语言模型(LLMs)强大的推理能力相媲美,这限制了LMMs在复杂下游任务中的表现。近期,通过采用专家混合(MoE)机制替代视觉编码器,这一问题得到了缓解,该机制为多样化的下游任务提供了丰富、多层次且多样的表征。多模态MoE的性能很大程度上依赖于其路由器,该路由器根据每个输入重新加权并混合不同专家的表征。然而,我们发现端到端训练的路由器并不总是能为每个测试样本生成最优的路由权重。为弥补这一差距,我们提出了一种新颖且高效的方法——“测试时重路由”(R2-T2),该方法通过在测试样本的邻域内将路由权重向量向正确预测样本的向量方向移动,实现局部优化。我们提出了三种具有不同优化目标和邻域搜索空间的R2-T2策略。R2-T2在不训练任何基础模型参数的情况下,持续且显著地提升了当前最先进LMMs在多样化任务挑战性基准测试中的表现。
English
In large multimodal models (LMMs), the perception of non-language modalities
(e.g., visual representations) is usually not on par with the large language
models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on
challenging downstream tasks. This weakness has been recently mitigated by
replacing the vision encoder with a mixture-of-experts (MoE), which provides
rich, multi-granularity, and diverse representations required by diverse
downstream tasks. The performance of multimodal MoE largely depends on its
router, which reweights and mixes the representations of different experts for
each input. However, we find that the end-to-end trained router does not always
produce the optimal routing weights for every test sample. To bridge the gap,
we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that
locally optimizes the vector of routing weights in test-time by moving it
toward those vectors of the correctly predicted samples in a neighborhood of
the test sample. We propose three R2-T2 strategies with different optimization
objectives and neighbor-search spaces. R2-T2 consistently and greatly improves
state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks,
without training any base-model parameters.Summary
AI-Generated Summary