ChatPaper.aiChatPaper

视觉搜索助手:将视觉-语言模型赋能为多模态搜索引擎

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

October 28, 2024
作者: Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue
cs.AI

摘要

搜索引擎可以通过文本检索未知信息。然而,当涉及理解不熟悉的视觉内容时,传统方法存在局限,比如识别模型从未见过的物体。这对大型视觉语言模型(VLMs)尤为明显:如果模型没有接触过图像中所描绘的物体,它将难以为用户关于该图像的问题生成可靠答案。此外,随着新物体和事件不断出现,由于巨大的计算负担,频繁更新VLMs是不切实际的。为了解决这一限制,我们提出了Vision Search Assistant,这是一个新颖的框架,促进了VLMs和网络代理之间的协作。这种方法利用了VLMs的视觉理解能力和网络代理的实时信息访问,通过网络执行开放世界的检索增强生成。通过这种协作整合视觉和文本表示,即使图像对系统是新的,模型也能提供知情回应。在开放集和封闭集问答基准上进行的大量实验表明,Vision Search Assistant明显优于其他模型,并可以广泛应用于现有的VLMs。
English
Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

Summary

AI-Generated Summary

PDF102November 16, 2024