簡約之可擴展性:基於單一Transformer的視覺-語言學習實證分析
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
April 14, 2025
作者: Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang
cs.AI
摘要
本文介紹了SAIL,這是一種單一Transformer架構的多模態大型語言模型(MLLM),它將原始像素編碼與語言解碼整合於一體。與現有的模組化MLLM不同,後者依賴於預訓練的視覺Transformer(ViT),而SAIL則省去了獨立的視覺編碼器,呈現出更為簡約的架構設計。SAIL並未引入新的架構組件,而是採用了混合注意力機制和多模態位置編碼,以更好地對齊視覺與文本模態的獨特特性。我們系統性地比較了SAIL的特性——包括可擴展性、跨模態信息流模式及視覺表徵能力——與模組化MLLM的差異。通過擴大訓練數據和模型規模,SAIL達到了與模組化MLLM相當的性能。值得注意的是,移除預訓練的ViT組件增強了SAIL的可擴展性,並導致了顯著不同的跨模態信息流模式。此外,SAIL展現出強大的視覺表徵能力,在語義分割等視覺任務中取得了與ViT-22B相媲美的成果。代碼和模型可在https://github.com/bytedance/SAIL獲取。
English
This paper introduces SAIL, a single transformer unified multimodal large
language model (MLLM) that integrates raw pixel encoding and language decoding
within a singular architecture. Unlike existing modular MLLMs, which rely on a
pre-trained vision transformer (ViT), SAIL eliminates the need for a separate
vision encoder, presenting a more minimalist architecture design. Instead of
introducing novel architectural components, SAIL adapts mix-attention
mechanisms and multimodal positional encodings to better align with the
distinct characteristics of visual and textual modalities. We systematically
compare SAIL's properties-including scalability, cross-modal information flow
patterns, and visual representation capabilities-with those of modular MLLMs.
By scaling both training data and model size, SAIL achieves performance
comparable to modular MLLMs. Notably, the removal of pretrained ViT components
enhances SAIL's scalability and results in significantly different cross-modal
information flow patterns. Moreover, SAIL demonstrates strong visual
representation capabilities, achieving results on par with ViT-22B in vision
tasks such as semantic segmentation. Code and models are available at
https://github.com/bytedance/SAIL.Summary
AI-Generated Summary