大型視覺語言模型中的圖像關注提示
Attention Prompting on Image for Large Vision-Language Models
September 25, 2024
作者: Runpeng Yu, Weihao Yu, Xinchao Wang
cs.AI
摘要
相較於大型語言模型(LLMs),大型視覺語言模型(LVLMs)也能接受圖像作為輸入,展現更有趣的新興能力並在各種視覺語言任務上展現出色的表現。受LLMs中文本提示的啟發,已經開始探索視覺提示以增強LVLMs感知視覺信息的能力。然而,先前的視覺提示技術僅處理視覺輸入,並未考慮文本查詢,限制了模型按照文本指示完成任務的能力。為填補此空白,本研究提出了一種名為圖像上的注意提示的新提示技術,該技術簡單地在原始輸入圖像上覆蓋一個文本查詢引導的注意力熱圖,有效增強LVLM在各種任務上的表現。具體而言,我們使用輔助模型如CLIP為輸入圖像生成依賴於文本查詢的注意力熱圖。然後,該熱圖僅將原始圖像的像素值相乘,以獲得LVLM的實際輸入圖像。在各種視覺語言基準測試上進行了大量實驗,驗證了我們技術的有效性。例如,圖像上的注意提示在MM-Vet和LLaVA-Wild基準測試上分別使LLaVA-1.5提高了3.8%和2.9%。
English
Compared with Large Language Models (LLMs), Large Vision-Language Models
(LVLMs) can also accept images as input, thus showcasing more interesting
emergent capabilities and demonstrating impressive performance on various
vision-language tasks. Motivated by text prompting in LLMs, visual prompting
has been explored to enhance LVLMs' capabilities of perceiving visual
information. However, previous visual prompting techniques solely process
visual inputs without considering text queries, limiting the models' ability to
follow text instructions to complete tasks. To fill this gap, in this work, we
propose a new prompting technique named Attention Prompting on Image, which
just simply overlays a text-query-guided attention heatmap on the original
input image and effectively enhances LVLM on various tasks. Specifically, we
generate an attention heatmap for the input image dependent on the text query
with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel
values of the original image to obtain the actual input image for the LVLM.
Extensive experiments on various vison-language benchmarks verify the
effectiveness of our technique. For example, Attention Prompting on Image
improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks,
respectively.Summary
AI-Generated Summary