視覺搜尋助理：賦予視覺語言模型多模態搜尋引擎的能力

摘要

搜尋引擎能夠透過文字檢索未知資訊。然而，傳統方法在理解不熟悉的視覺內容時存在不足，例如識別模型從未見過的物體。這個挑戰對於大型視覺語言模型（VLMs）尤為明顯：如果模型沒有接觸過圖像中所描繪的物體，它將難以對用戶關於該圖像的問題生成可靠答案。此外，隨著新物體和事件不斷出現，經常更新VLMs由於龐大的計算負擔而變得不切實際。為了解決這個限制，我們提出了Vision Search Assistant，一個新穎的框架，促進了VLMs和網路代理之間的協作。這種方法利用了VLMs的視覺理解能力和網路代理的即時信息訪問，通過網路執行開放世界的檢索增強生成。通過這種協作整合視覺和文本表示，即使圖像對系統是新的，模型也能提供知情回應。在開放集和封閉集QA基準上進行的大量實驗表明，Vision Search Assistant明顯優於其他模型，並且可以廣泛應用於現有的VLMs中。

English

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

視覺搜尋助理：賦予視覺語言模型多模態搜尋引擎的能力

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

摘要

Summary

Support

Support