OLA-VLM:利用輔助嵌入蒸餾提升多模態LLM中的視覺感知

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

December 12, 2024
作者: Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang
cs.AI

摘要

在開發當代多模態語言模型(MLLMs)的標準做法是將視覺編碼器的特徵輸入到語言模型,並使用自然語言監督進行訓練。在這項工作中,我們提出了一個被忽視的機會,通過視覺角度(客觀)來優化中間的語言模型表示,即僅使用自然語言監督對於 MLLM 的視覺理解能力是次優的。為此,我們提出了 OLA-VLM,這是第一種從一組目標視覺表示中提煉知識到語言模型的隱藏表示的方法。首先,在 MLLMs 的預訓練階段,我們將目標定義為預測性視覺嵌入和下一個文本標記預測的耦合優化。其次,我們研究僅使用自然語言監督訓練的 MLLMs,並確定這些模型中視覺表示的質量與其下游性能之間存在正相關。此外,在對我們的 OLA-VLM 進行探究時,我們觀察到由於嵌入優化而提高了表示質量。第三,我們證明了我們的 OLA-VLM 優於單編碼器和多編碼器基線,證明了我們的方法優於將相應特徵明確輸入到語言模型中。特別是,OLA-VLM 在各種基準測試中平均提高了高達 2.5% 的性能,並在 CV-Bench 的深度任務上實現了 8.7% 的顯著改進。我們的代碼在 https://github.com/SHI-Labs/OLA-VLM 上開源。
English
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .

Summary

AI-Generated Summary

PDF112December 13, 2024