ChatPaper.aiChatPaper

ADEM-VL:適應性和嵌入式融合,用於有效的視覺語言調整

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

October 23, 2024
作者: Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Yonggang Wen
cs.AI

摘要

最近在多模態融合方面取得的進展見證了視覺語言(VL)模型的顯著成功,這些模型在各種多模態應用中表現出色,如圖像標註和視覺問答。然而,構建VL模型需要大量硬體資源,效率受到兩個關鍵因素的限制:語言模型與視覺特徵的擴展輸入序列需要更多的計算操作,以及大量的額外可學習參數增加了記憶體複雜度。這些挑戰顯著限制了這些模型的更廣泛應用。為彌合這一差距,我們提出 ADEM-VL,一種高效的視覺語言方法,通過採用基於預訓練大型語言模型(LLMs)的無參數交叉注意力機制來調整VL模型,以進行多模態融合中的相似度測量。這種方法只需要將視覺特徵嵌入語言空間,顯著減少可訓練參數的數量,並加快訓練和推理速度。為了增強融合模塊中的表示學習,我們引入了一種高效的多尺度特徵生成方案,只需要通過視覺編碼器進行一次前向傳遞。此外,我們提出了一種自適應融合方案,根據每個文本標記的注意力分數動態丟棄較不相關的視覺信息。這確保了融合過程優先考慮最相關的視覺特徵。通過在各種任務上進行實驗,包括視覺問答、圖像標註和指示遵循,我們展示了我們的框架優於現有方法。具體而言,我們的方法在 ScienceQA 數據集上的平均準確率比現有方法高出 0.77%,同時減少了訓練和推理延遲,展示了我們框架的優越性。代碼可在 https://github.com/Hao840/ADEM-VL 找到。
English
Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires substantial hardware resources, where efficiency is restricted by two key factors: the extended input sequence of the language model with vision features demands more computational operations, and a large number of additional learnable parameters increase memory complexity. These challenges significantly restrict the broader applicability of such models. To bridge this gap, we propose ADEM-VL, an efficient vision-language method that tunes VL models based on pretrained large language models (LLMs) by adopting a parameter-free cross-attention mechanism for similarity measurements in multimodal fusion. This approach only requires embedding vision features into the language space, significantly reducing the number of trainable parameters and accelerating both training and inference speeds. To enhance representation learning in fusion module, we introduce an efficient multiscale feature generation scheme that requires only a single forward pass through the vision encoder. Moreover, we propose an adaptive fusion scheme that dynamically discards less relevant visual information for each text token based on its attention score. This ensures that the fusion process prioritizes the most pertinent visual features. With experiments on various tasks including visual question answering, image captioning, and instruction-following, we demonstrate that our framework outperforms existing approaches. Specifically, our method surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset, with reduced training and inference latency, demonstrating the superiority of our framework. The code is available at https://github.com/Hao840/ADEM-VL.

Summary

AI-Generated Summary

PDF92November 16, 2024