视觉搜索助手:将视觉-语言模型赋能为多模态搜索引擎
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
October 28, 2024
作者: Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue
cs.AI
摘要
搜索引擎可以通过文本检索未知信息。然而,当涉及理解不熟悉的视觉内容时,传统方法存在局限,比如识别模型从未见过的物体。这对大型视觉语言模型(VLMs)尤为明显:如果模型没有接触过图像中所描绘的物体,它将难以为用户关于该图像的问题生成可靠答案。此外,随着新物体和事件不断出现,由于巨大的计算负担,频繁更新VLMs是不切实际的。为了解决这一限制,我们提出了Vision Search Assistant,这是一个新颖的框架,促进了VLMs和网络代理之间的协作。这种方法利用了VLMs的视觉理解能力和网络代理的实时信息访问,通过网络执行开放世界的检索增强生成。通过这种协作整合视觉和文本表示,即使图像对系统是新的,模型也能提供知情回应。在开放集和封闭集问答基准上进行的大量实验表明,Vision Search Assistant明显优于其他模型,并可以广泛应用于现有的VLMs。
English
Search engines enable the retrieval of unknown information with texts.
However, traditional methods fall short when it comes to understanding
unfamiliar visual content, such as identifying an object that the model has
never seen before. This challenge is particularly pronounced for large
vision-language models (VLMs): if the model has not been exposed to the object
depicted in an image, it struggles to generate reliable answers to the user's
question regarding that image. Moreover, as new objects and events continuously
emerge, frequently updating VLMs is impractical due to heavy computational
burdens. To address this limitation, we propose Vision Search Assistant, a
novel framework that facilitates collaboration between VLMs and web agents.
This approach leverages VLMs' visual understanding capabilities and web agents'
real-time information access to perform open-world Retrieval-Augmented
Generation via the web. By integrating visual and textual representations
through this collaboration, the model can provide informed responses even when
the image is novel to the system. Extensive experiments conducted on both
open-set and closed-set QA benchmarks demonstrate that the Vision Search
Assistant significantly outperforms the other models and can be widely applied
to existing VLMs.Summary
AI-Generated Summary