原生多模態模型的規模化法則
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
April 10, 2025
作者: Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby
cs.AI
摘要
構建能夠有效感知多模態信號的通用模型一直是一個長期目標。當前的方法涉及整合單獨預訓練的組件,例如將視覺編碼器連接到大型語言模型(LLMs)並繼續進行多模態訓練。雖然這些方法展現了顯著的樣本效率,但這種後期融合架構是否本質上更優仍是一個開放性問題。在本研究中,我們重新審視了原生多模態模型(NMMs)——即從頭開始在所有模態上訓練的模型——的架構設計,並進行了一項廣泛的規模定律研究,涵蓋了457個不同架構和訓練混合比例的訓練模型。我們的調查顯示,後期融合架構相較於不依賴圖像編碼器的早期融合架構並無本質優勢。相反,早期融合在較低參數數量下表現更強,訓練效率更高,且更易於部署。基於早期融合架構的強勁表現,我們展示了引入專家混合(MoEs)可以使模型學習模態特定的權重,從而顯著提升性能。
English
Building general-purpose models that can effectively perceive the world
through multimodal signals has been a long-standing goal. Current approaches
involve integrating separately pre-trained components, such as connecting
vision encoders to LLMs and continuing multimodal training. While such
approaches exhibit remarkable sample efficiency, it remains an open question
whether such late-fusion architectures are inherently superior. In this work,
we revisit the architectural design of native multimodal models (NMMs)--those
trained from the ground up on all modalities--and conduct an extensive scaling
laws study, spanning 457 trained models with different architectures and
training mixtures. Our investigation reveals no inherent advantage to
late-fusion architectures over early-fusion ones, which do not rely on image
encoders. On the contrary, early-fusion exhibits stronger performance at lower
parameter counts, is more efficient to train, and is easier to deploy.
Motivated by the strong performance of the early-fusion architectures, we show
that incorporating Mixture of Experts (MoEs) allows for models that learn
modality-specific weights, significantly enhancing performance.Summary
AI-Generated Summary