ChatPaper.aiChatPaper

VLsI:从大到小的视觉语言模型中的层级交互 verbalized layers-to-interactions.

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

December 2, 2024
作者: Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu
cs.AI

摘要

最近,来自闭源视觉语言模型(VLMs)如GPT-4V的高质量视觉指导调整样本激发了各种模型规模的开源VLMs的发布。然而,通过扩展VLMs以提高性能所带来的计算挑战是巨大的,特别是对于在资源受限设备上部署,比如移动平台和机器人。为了解决这个问题,我们提出了VLsI:Verbalized Layers-to-Interactions,这是一个新的VLM家族,包括2B和7B模型规模,它在不损害准确性的前提下优先考虑效率。VLsI利用一种独特的逐层蒸馏过程,引入了中间的“语言化器”,将每一层的特征映射到自然语言空间,使较小的VLMs能够灵活地与较大的VLMs的推理过程对齐。这种方法通过将小型VLMs的逐层进展与大型VLMs的进展对齐,缓解了通常在输出模仿中遇到的训练不稳定性,并超越了典型的最终层调整。我们在十个具有挑战性的视觉语言基准测试中验证了VLsI,实现了显著的性能提升(2B为11.0%,7B为17.4%),而无需进行模型扩展、合并或架构更改。
English
The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.

Summary

AI-Generated Summary

PDF152December 3, 2024