VLsI：從大視覺到小視覺的語言模型中的層級互動

摘要

最近，來自閉源視覺語言模型（VLMs）如GPT-4V的高質量視覺指導調整樣本激增，加速了各種模型尺寸的開源VLMs的釋出。然而，將VLMs進行規模化以提高性能會帶來顯著的計算挑戰，尤其是在資源受限的設備（如移動平台和機器人）上部署。為了應對這一挑戰，我們提出了VLsI：Verbalized Layers-to-Interactions，這是一個新的VLM家族，包括2B和7B模型尺寸，著重於效率而不會影響準確性。VLsI利用獨特的逐層精煉過程，引入中間的“口語化器”，將每一層的特徵映射到自然語言空間，使較小的VLMs能夠靈活地與較大的VLMs的推理過程對齊。這種方法有助於緩解通常在輸出模仿中遇到的訓練不穩定性，並通過使小型VLMs的逐層進展與大型VLMs的進展對齊，超越了典型的最終層調整。我們在十個具有挑戰性的視覺語言基準測試中驗證了VLsI，在不需要模型規模化、合併或架構更改的情況下，實現了顯著的性能提升（2B提高了11.0%，7B提高了17.4%），超越了GPT-4V。

English

The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.

VLsI：從大視覺到小視覺的語言模型中的層級互動

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

摘要

Support