VLsI:從大視覺到小視覺的語言模型中的層級互動
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
December 2, 2024
作者: Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu
cs.AI
摘要
最近,來自閉源視覺語言模型(VLMs)如GPT-4V的高質量視覺指導調整樣本激增,加速了各種模型尺寸的開源VLMs的釋出。然而,將VLMs進行規模化以提高性能會帶來顯著的計算挑戰,尤其是在資源受限的設備(如移動平台和機器人)上部署。為了應對這一挑戰,我們提出了VLsI:Verbalized Layers-to-Interactions,這是一個新的VLM家族,包括2B和7B模型尺寸,著重於效率而不會影響準確性。VLsI利用獨特的逐層精煉過程,引入中間的“口語化器”,將每一層的特徵映射到自然語言空間,使較小的VLMs能夠靈活地與較大的VLMs的推理過程對齊。這種方法有助於緩解通常在輸出模仿中遇到的訓練不穩定性,並通過使小型VLMs的逐層進展與大型VLMs的進展對齊,超越了典型的最終層調整。我們在十個具有挑戰性的視覺語言基準測試中驗證了VLsI,在不需要模型規模化、合併或架構更改的情況下,實現了顯著的性能提升(2B提高了11.0%,7B提高了17.4%),超越了GPT-4V。
English
The recent surge in high-quality visual instruction tuning samples from
closed-source vision-language models (VLMs) such as GPT-4V has accelerated the
release of open-source VLMs across various model sizes. However, scaling VLMs
to improve performance using larger models brings significant computational
challenges, especially for deployment on resource-constrained devices like
mobile platforms and robots. To address this, we propose VLsI: Verbalized
Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which
prioritizes efficiency without compromising accuracy. VLsI leverages a unique,
layer-wise distillation process, introducing intermediate "verbalizers" that
map features from each layer to natural language space, allowing smaller VLMs
to flexibly align with the reasoning processes of larger VLMs. This approach
mitigates the training instability often encountered in output imitation and
goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise
progression with that of the large ones. We validate VLsI across ten
challenging vision-language benchmarks, achieving notable performance gains
(11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling,
merging, or architectural changes.Summary
AI-Generated Summary