OLA-VLM: 補助埋め込み蒸留を用いたマルチモーダルLLMにおける視覚認識の向上

要旨

現代のMLLMを開発する標準的な手法は、ビジョンエンコーダーからの特徴をLLMに供給し、自然言語の教示で訓練することです。本研究では、中間のLLM表現を最適化するための見落とされている機会があると考えています。つまり、単に自然言語の教示だけでは、MLLMの視覚理解能力において最適ではないということです。このため、我々はOLA-VLMを提案します。これは、ターゲットの視覚表現のセットからLLMの隠れた表現に知識を蒸留する最初のアプローチです。まず、MLLMの事前学習段階における目的を、予測的な視覚埋め込みと次のテキストトークン予測の連携最適化として定式化します。次に、単に自然言語の教示で訓練されたMLLMを調査し、これらのモデル内の視覚表現の質とそれらの下流パフォーマンスとの間に正の相関関係を特定します。さらに、我々のOLA-VLMを調査することで、埋め込みの最適化による表現の質の向上を観察します。第三に、OLA-VLMが単一およびマルチエンコーダーのベースラインを凌駕し、対応する特徴をLLMに明示的に供給するよりも優れていることを証明します。特に、OLA-VLMは、さまざまなベンチマークで平均で最大2.5%のマージンでパフォーマンスを向上させ、CV-BenchのDepthタスクでは8.7%の顕著な改善を実現しています。当該コードはhttps://github.com/SHI-Labs/OLA-VLM でオープンソースとして公開されています。

English

The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .

OLA-VLM: 補助埋め込み蒸留を用いたマルチモーダルLLMにおける視覚認識の向上

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

要旨

Summary

Support