OLA-VLM:利用辅助嵌入蒸馏提升多模态LLM中的视觉感知

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

December 12, 2024
作者: Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang
cs.AI

摘要

开发当代MLLM的标准做法是将视觉编码器的特征输入LLM,并使用自然语言监督进行训练。在这项工作中,我们提出了一个被忽视的优化机会,即通过视觉角度(目标)优化中间LLM表示,即仅使用自然语言监督对于MLLM的视觉理解能力来说是次优的。为此,我们提出了OLA-VLM,这是第一种从一组目标视觉表示中提炼知识到LLM的隐藏表示的方法。首先,在MLLM的预训练阶段,我们将目标形式化为对预测性视觉嵌入和下一个文本标记预测的耦合优化。其次,我们研究了仅使用自然语言监督训练的MLLM,并确定了这些模型中视觉表示质量与它们下游性能之间的正相关性。此外,在探究我们的OLA-VLM时,我们观察到由于嵌入优化而导致表示质量的提高。第三,我们证明了我们的OLA-VLM优于单编码器和多编码器基线,证实了我们的方法优于明确将相应特征输入LLM的方法。特别是,OLA-VLM在各种基准测试中将性能提升了平均高达2.5%,在CV-Bench的深度任务中有显著的8.7%改进。我们的代码在https://github.com/SHI-Labs/OLA-VLM上开源。
English
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .

Summary

AI-Generated Summary

PDF112December 13, 2024