ChatPaper.aiChatPaper

CLIP-MoE:朝向建立具有多樣化多重循環利用的 CLIP 專家混合模型

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

September 28, 2024
作者: Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng
cs.AI

摘要

近年來,對比語言-圖像預訓練(CLIP)已成為多模態智能的基石。然而,最近的研究發現,在CLIP編碼過程中存在著相當大的信息損失,且CLIP傾向於僅捕捉輸入中的粗粒度特徵。這種不足顯著地限制了單個CLIP模型處理視覺細節豐富的圖像的能力。在本研究中,我們提出了一種簡單而有效的模型無關策略,稱為多樣化多重升級(DMU),用於CLIP。DMU從一個密集預訓練的CLIP檢查點高效地微調一系列捕捉不同特徵空間的CLIP模型,這些模型共享參數,除了前饋網絡(FFN)。然後,這些模型可以轉換為具有更大模型容量的CLIP-MoE,從而在最小計算開銷下顯著提升性能。據我們所知,多樣化多重升級是第一種將稀疏激活MoE引入CLIP基礎模型的方法。大量實驗證明了CLIP-MoE在各種零樣本檢索、零樣本圖像分類任務以及作為視覺編碼器的下游多模態大型語言模型(MLLM)基準測試中的顯著性能。此外,多樣化多重升級使得任何密集CLIP模型都可以無縫地轉換為CLIP-MoEs,可以在下游框架中取代CLIP,而無需進行進一步的適應。通過多樣化多重升級,我們旨在為未來開發更高效和有效的多模態學習系統提供有價值的見解。
English
In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies have identified that the information loss in the CLIP encoding process is substantial, and CLIP tends to capture only coarse-grained features from the input. This deficiency significantly limits the ability of a single CLIP model to handle images rich in visual detail. In this work, we propose a simple yet effective model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for the Feed-Forward Network (FFN). These models can then be transformed into a CLIP-MoE with a larger model capacity, leading to significantly enhanced performance with minimal computational overhead. To the best of our knowledge, Diversified Multiplet Upcycling is the first approach to introduce sparsely activated MoE into CLIP foundation models. Extensive experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks by serving as a vision encoder. Furthermore, Diversified Multiplet Upcycling enables the conversion of any dense CLIP model into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner without requiring further adaptation in downstream frameworks. Through Diversified Multiplet Upcycling, we aim to provide valuable insights for future research on developing more efficient and effective multimodal learning systems.

Summary

AI-Generated Summary

PDF202November 16, 2024