简约的可扩展性:基于单一Transformer的视觉-语言学习实证分析
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
April 14, 2025
作者: Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang
cs.AI
摘要
本文介绍了SAIL,一种单一Transformer架构的统一多模态大语言模型(MLLM),它将原始像素编码与语言解码集成于一体。与依赖预训练视觉Transformer(ViT)的现有模块化MLLM不同,SAIL摒弃了独立的视觉编码器,呈现出更为简约的架构设计。SAIL并未引入新的架构组件,而是通过混合注意力机制和多模态位置编码的调整,更好地适应视觉与文本模态的独特特性。我们系统性地比较了SAIL在可扩展性、跨模态信息流模式及视觉表征能力等方面与模块化MLLM的差异。通过同步扩大训练数据与模型规模,SAIL实现了与模块化MLLM相当的性能。尤为突出的是,去除预训练ViT组件增强了SAIL的可扩展性,并导致了显著不同的跨模态信息流模式。此外,SAIL展现出强大的视觉表征能力,在语义分割等视觉任务中取得了与ViT-22B相当的结果。代码与模型已发布于https://github.com/bytedance/SAIL。
English
This paper introduces SAIL, a single transformer unified multimodal large
language model (MLLM) that integrates raw pixel encoding and language decoding
within a singular architecture. Unlike existing modular MLLMs, which rely on a
pre-trained vision transformer (ViT), SAIL eliminates the need for a separate
vision encoder, presenting a more minimalist architecture design. Instead of
introducing novel architectural components, SAIL adapts mix-attention
mechanisms and multimodal positional encodings to better align with the
distinct characteristics of visual and textual modalities. We systematically
compare SAIL's properties-including scalability, cross-modal information flow
patterns, and visual representation capabilities-with those of modular MLLMs.
By scaling both training data and model size, SAIL achieves performance
comparable to modular MLLMs. Notably, the removal of pretrained ViT components
enhances SAIL's scalability and results in significantly different cross-modal
information flow patterns. Moreover, SAIL demonstrates strong visual
representation capabilities, achieving results on par with ViT-22B in vision
tasks such as semantic segmentation. Code and models are available at
https://github.com/bytedance/SAIL.Summary
AI-Generated Summary