PaliGemma 2:用于迁移的多功能VLM家族
PaliGemma 2: A Family of Versatile VLMs for Transfer
December 4, 2024
作者: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai
cs.AI
摘要
PaliGemma 2 是基于 Gemma 2 语言模型系列的 PaliGemma 开放式视觉-语言模型(VLM)的升级版。我们将 SigLIP-So400m 视觉编码器与 PaliGemma 同样使用的整个 Gemma 2 模型系列结合起来,从 2B 模型一直到 27B 模型。我们在三种分辨率(224px、448px 和 896px)上多阶段训练这些模型,为它们提供广泛的知识,以便通过微调进行迁移学习。由此产生的基础模型系列涵盖不同的模型大小和分辨率,使我们能够研究影响迁移性能的因素(如学习率),并分析任务类型、模型大小和分辨率之间的相互作用。我们进一步增加了超出 PaliGemma 范围的迁移任务数量和广度,包括不同的光学字符识别相关任务,如表结构识别、分子结构识别、乐谱识别,以及长文本描述和放射学报告生成等任务,在这些任务上,PaliGemma 2 获得了最先进的结果。
English
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM)
based on the Gemma 2 family of language models. We combine the SigLIP-So400m
vision encoder that was also used by PaliGemma with the whole range of Gemma 2
models, from the 2B one all the way up to the 27B model. We train these models
at three resolutions (224px, 448px, and 896px) in multiple stages to equip them
with broad knowledge for transfer via fine-tuning. The resulting family of base
models covering different model sizes and resolutions allows us to investigate
factors impacting transfer performance (such as learning rate) and to analyze
the interplay between the type of task, model size, and resolution. We further
increase the number and breadth of transfer tasks beyond the scope of PaliGemma
including different OCR-related tasks such as table structure recognition,
molecular structure recognition, music score recognition, as well as long
fine-grained captioning and radiography report generation, on which PaliGemma 2
obtains state-of-the-art results.Summary
AI-Generated Summary