ChatPaper.aiChatPaper

原生多模态模型的缩放定律

Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models

April 10, 2025
作者: Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby
cs.AI

摘要

构建能够通过多模态信号有效感知世界的通用模型,一直是长期追求的目标。当前的方法主要涉及整合分别预训练的组件,例如将视觉编码器与大型语言模型(LLMs)连接,并继续进行多模态训练。尽管这些方法展现出显著的样本效率,但此类后期融合架构是否本质上更为优越仍是一个开放性问题。在本研究中,我们重新审视了原生多模态模型(NMMs)——即从零开始在所有模态上训练的模型——的架构设计,并开展了一项广泛的缩放定律研究,涵盖了457个不同架构和训练组合的模型。我们的调查发现,后期融合架构相较于早期融合架构并无固有优势,后者并不依赖图像编码器。相反,早期融合在较低参数规模下表现出更强的性能,训练效率更高,且更易于部署。受早期融合架构优异表现的启发,我们展示了引入专家混合(MoEs)机制,使模型能够学习模态特定的权重,从而显著提升性能。
English
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)--those trained from the ground up on all modalities--and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows for models that learn modality-specific weights, significantly enhancing performance.

Summary

AI-Generated Summary

PDF272April 11, 2025