PaliGemma 2:適用於轉移學習的多功能VLM家族
PaliGemma 2: A Family of Versatile VLMs for Transfer
December 4, 2024
作者: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai
cs.AI
摘要
PaliGemma 2 是基於 Gemma 2 語言模型家族的 PaliGemma 開放式視覺語言模型(VLM)的升級版本。我們將 SigLIP-So400m 視覺編碼器與 PaliGemma 一同使用的Gemme 2 系列模型結合起來,從 2B 模型一直到 27B 模型。我們在三個解析度(224px、448px 和 896px)上進行多階段訓練,為這些模型提供廣泛的知識,以便透過微調進行轉移。結果形成的基礎模型家族涵蓋不同的模型大小和解析度,使我們能夠研究影響轉移性能的因素(如學習速率),並分析任務類型、模型大小和解析度之間的相互作用。我們進一步擴大了轉移任務的數量和範圍,超出了 PaliGemma 的範圍,包括不同的OCR相關任務,如表結構識別、分子結構識別、音樂譜識別,以及長篇細緻字幕和放射學報告生成,PaliGemma 2 在這些任務上取得了最先進的結果。
English
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM)
based on the Gemma 2 family of language models. We combine the SigLIP-So400m
vision encoder that was also used by PaliGemma with the whole range of Gemma 2
models, from the 2B one all the way up to the 27B model. We train these models
at three resolutions (224px, 448px, and 896px) in multiple stages to equip them
with broad knowledge for transfer via fine-tuning. The resulting family of base
models covering different model sizes and resolutions allows us to investigate
factors impacting transfer performance (such as learning rate) and to analyze
the interplay between the type of task, model size, and resolution. We further
increase the number and breadth of transfer tasks beyond the scope of PaliGemma
including different OCR-related tasks such as table structure recognition,
molecular structure recognition, music score recognition, as well as long
fine-grained captioning and radiography report generation, on which PaliGemma 2
obtains state-of-the-art results.Summary
AI-Generated Summary