SigLIP 2:具备增强语义理解、定位能力与密集特征的多语言视觉-语言编码器
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
February 20, 2025
作者: Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai
cs.AI
摘要
我们推出了SigLIP 2,这是一系列基于原版SigLIP成功经验的多语言视觉-语言编码器。在此次迭代中,我们将原有的图文训练目标与多项先前独立开发的技术相结合,形成了一套统一的训练方案——这包括基于字幕的预训练、自监督损失(自蒸馏、掩码预测)以及在线数据筛选。通过这些改进,SigLIP 2模型在所有模型规模的核心能力上均超越了其前代SigLIP,这些能力涵盖零样本分类、图文检索,以及为视觉-语言模型(VLMs)提取视觉表征时的迁移性能。此外,新的训练方案还显著提升了定位和密集预测任务的表现。我们还训练了支持多种分辨率并保持输入原始宽高比的变体模型。最后,我们在包含去偏技术的更加多样化的数据混合上进行训练,从而大幅提升了多语言理解能力并增强了公平性。为了让用户能够在推理成本与性能之间做出权衡,我们发布了四种规模的模型检查点:ViT-B(86M)、L(303M)、So400m(400M)和g(1B)。
English
We introduce SigLIP 2, a family of new multilingual vision-language encoders
that build on the success of the original SigLIP. In this second iteration, we
extend the original image-text training objective with several prior,
independently developed techniques into a unified recipe -- this includes
captioning-based pretraining, self-supervised losses (self-distillation, masked
prediction) and online data curation. With these changes, SigLIP 2 models
outperform their SigLIP counterparts at all model scales in core capabilities,
including zero-shot classification, image-text retrieval, and transfer
performance when extracting visual representations for Vision-Language Models
(VLMs). Furthermore, the new training recipe leads to significant improvements
on localization and dense prediction tasks. We also train variants which
support multiple resolutions and preserve the input's native aspect ratio.
Finally, we train on a more diverse data-mixture that includes de-biasing
techniques, leading to much better multilingual understanding and improved
fairness. To allow users to trade off inference cost with performance, we
release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M),
and g (1B).Summary
AI-Generated Summary