비전 검색 어시스턴트: 비전-언어 모델을 멀티모달 검색 엔진으로 강화하기

초록

검색 엔진은 텍스트를 통해 알려지지 않은 정보를 검색할 수 있게 합니다. 그러나 전통적인 방법은 생소한 시각적 콘텐츠를 이해하는 데 한계가 있습니다. 예를 들어, 모델이 이전에 본 적이 없는 객체를 식별하는 것과 같은 작업입니다. 이러한 도전 과제는 대형 시각-언어 모델(VLMs)에서 특히 두드러집니다: 모델이 이미지에 나타난 객체에 노출되지 않았다면, 해당 이미지에 관한 사용자 질문에 신뢰할 수 있는 답변을 생성하는 데 어려움을 겪습니다. 게다가, 새로운 객체와 사건이 지속적으로 등장함에 따라 VLMs를 자주 업데이트하는 것은 계산 부담이 매우 크기 때문에 실용적이지 않습니다. 이 한계를 극복하기 위해, 우리는 Vision Search Assistant를 제안합니다. 이는 VLMs와 웹 에이전트 간의 협력을 촉진하는 혁신적인 프레임워크입니다. 이 접근 방식은 VLMs의 시각적 이해 능력과 웹 에이전트의 실시간 정보 접근을 활용하여 웹을 통한 오픈 월드 검색-증강 생성을 수행합니다. 이 협력을 통해 시각적 및 텍스트 표현을 통합함으로써, 시스템에게 이미지가 새로운 경우에도 정보를 제공할 수 있습니다. 오픈셋과 클로즈셋 QA 벤치마크에서 수행된 포괄적인 실험 결과는 Vision Search Assistant가 다른 모델보다 현저히 우수하며 기존 VLMs에 널리 적용될 수 있다는 것을 보여줍니다.

English

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

비전 검색 어시스턴트: 비전-언어 모델을 멀티모달 검색 엔진으로 강화하기

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

초록

Summary

Support