扩展无语言视觉表征学习
Scaling Language-Free Visual Representation Learning
April 1, 2025
作者: David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie
cs.AI
摘要
视觉自监督学习(SSL)在多模态场景下,如视觉问答(VQA),目前表现逊色于对比语言-图像预训练(CLIP)。这一多模态差距常被归因于语言监督引入的语义信息,尽管视觉SSL与CLIP模型通常在不同数据上训练。本研究中,我们提出疑问:“视觉自监督方法落后于CLIP,是因为缺乏语言监督,还是训练数据的差异?”为解答此问题,我们在相同的MetaCLIP数据上训练视觉SSL与CLIP模型,并利用VQA作为视觉编码器的多样化测试平台。在这一控制性实验设置中,视觉SSL模型在数据和模型容量方面展现出优于CLIP模型的扩展能力,且即使参数规模扩大至70亿,视觉SSL性能仍未达到饱和。因此,我们观察到视觉SSL方法在广泛的VQA及经典视觉基准测试中达到了与CLIP相当的水平。这些发现表明,纯视觉自监督学习在大规模上能够匹敌语言监督的视觉预训练,为以视觉为中心的表示学习开辟了新的机遇。
English
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive
Language-Image Pretraining (CLIP) in multimodal settings such as Visual
Question Answering (VQA). This multimodal gap is often attributed to the
semantics introduced by language supervision, even though visual SSL and CLIP
models are often trained on different data. In this work, we ask the question:
"Do visual self-supervised approaches lag behind CLIP due to the lack of
language supervision, or differences in the training data?" We study this
question by training both visual SSL and CLIP models on the same MetaCLIP data,
and leveraging VQA as a diverse testbed for vision encoders. In this controlled
setup, visual SSL models scale better than CLIP models in terms of data and
model capacity, and visual SSL performance does not saturate even after scaling
up to 7B parameters. Consequently, we observe visual SSL methods achieve
CLIP-level performance on a wide range of VQA and classic vision benchmarks.
These findings demonstrate that pure visual SSL can match language-supervised
visual pretraining at scale, opening new opportunities for vision-centric
representation learning.Summary
AI-Generated Summary