MMSearch：評估大型模型作為多模式搜索引擎潛力的基準。

摘要

大型語言模型（LLMs）的出現為AI搜尋引擎，例如SearchGPT，開創了人類與互聯網互動的新範式。然而，大多數目前的AI搜尋引擎僅限於純文本設置，忽略了多模態用戶查詢和網站信息的文本-圖像交錯性質。近來，大型多模態模型（LMMs）取得了顯著進展。然而，它們是否能夠作為AI搜尋引擎運作仍未得到充分探討，使LMMs在多模態搜索中的潛力成為一個未解之謎。為此，我們首先設計了一個精心構建的流程，MMSearch-Engine，以賦予任何LMMs多模態搜索功能。在此基礎上，我們引入了MMSearch，一個全面評估LMMs多模態搜索性能的基準。這個精心挑選的數據集包含300個手動收集的實例，涵蓋14個子領域，並且與當前LMMs的訓練數據沒有重疊，確保只能在搜索中獲得正確答案。通過使用MMSearch-Engine，我們通過執行三個單獨任務（重新查詢、重新排名和總結）以及一個具有完整搜索過程的具有挑戰性的端到端任務來評估LMMs。我們對封閉源和開源LMMs進行了廣泛實驗。在所有測試模型中，具有MMSearch-Engine的GPT-4o取得了最佳結果，超越了商業產品Perplexity Pro，在端到端任務中展示了我們提出的流程的有效性。我們進一步進行錯誤分析，揭示當前LMMs仍然難以完全掌握多模態搜索任務，並進行消融研究，指出對AI搜尋引擎進行測試時計算規模化的潛力。我們希望MMSearch能夠提供獨特的見解，引導未來多模態AI搜尋引擎的發展。項目頁面：https://mmsearch.github.io

English

The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: https://mmsearch.github.io

MMSearch：評估大型模型作為多模式搜索引擎潛力的基準。

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

摘要

Summary

Support

Support