OLA-VLM:利用辅助嵌入蒸馏提升多模态LLM中的视觉感知
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
December 12, 2024
作者: Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang
cs.AI
摘要
开发当代MLLM的标准做法是将视觉编码器的特征输入LLM,并使用自然语言监督进行训练。在这项工作中,我们提出了一个被忽视的优化机会,即通过视觉角度(目标)优化中间LLM表示,即仅使用自然语言监督对于MLLM的视觉理解能力来说是次优的。为此,我们提出了OLA-VLM,这是第一种从一组目标视觉表示中提炼知识到LLM的隐藏表示的方法。首先,在MLLM的预训练阶段,我们将目标形式化为对预测性视觉嵌入和下一个文本标记预测的耦合优化。其次,我们研究了仅使用自然语言监督训练的MLLM,并确定了这些模型中视觉表示质量与它们下游性能之间的正相关性。此外,在探究我们的OLA-VLM时,我们观察到由于嵌入优化而导致表示质量的提高。第三,我们证明了我们的OLA-VLM优于单编码器和多编码器基线,证实了我们的方法优于明确将相应特征输入LLM的方法。特别是,OLA-VLM在各种基准测试中将性能提升了平均高达2.5%,在CV-Bench的深度任务中有显著的8.7%改进。我们的代码在https://github.com/SHI-Labs/OLA-VLM上开源。
English
The standard practice for developing contemporary MLLMs is to feed features
from vision encoder(s) into the LLM and train with natural language
supervision. In this work, we posit an overlooked opportunity to optimize the
intermediate LLM representations through a vision perspective (objective),
i.e., solely natural language supervision is sub-optimal for the MLLM's visual
understanding ability. To that end, we propose OLA-VLM, the first approach
distilling knowledge into the LLM's hidden representations from a set of target
visual representations. Firstly, we formulate the objective during the
pretraining stage in MLLMs as a coupled optimization of predictive visual
embedding and next text-token prediction. Secondly, we investigate MLLMs
trained solely with natural language supervision and identify a positive
correlation between the quality of visual representations within these models
and their downstream performance. Moreover, upon probing our OLA-VLM, we
observe improved representation quality owing to the embedding optimization.
Thirdly, we demonstrate that our OLA-VLM outperforms the single and
multi-encoder baselines, proving our approach's superiority over explicitly
feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts
performance by an average margin of up to 2.5% on various benchmarks, with a
notable improvement of 8.7% on the Depth task in CV-Bench. Our code is
open-sourced at https://github.com/SHI-Labs/OLA-VLM .Summary
AI-Generated Summary