VideoRAG：檢索增強生成視頻語料庫

摘要

檢索增強生成（RAG）是解決基礎模型中生成事實錯誤輸出問題的強大策略，它通過檢索與查詢相關的外部知識並將其納入生成過程中來解決這個問題。然而，現有的RAG方法主要集中在文本信息上，最近一些進展開始考慮圖像，但它們在很大程度上忽略了影片，這是一個豐富的多模式知識來源，能夠更有效地表示事件、過程和情境細節，優於其他模態。雖然最近有一些研究探索了在回應生成過程中整合影片，但它們要麼預先定義了與查詢相關的影片而沒有根據查詢檢索它們，要麼將影片轉換為文本描述而沒有利用它們的多模式豐富性。為了應對這些問題，我們介紹了VideoRAG，這是一個新穎的框架，不僅根據與查詢相關性動態檢索相關影片，還利用影片的視覺和文本信息進行輸出生成。此外，為了實現這一點，我們的方法圍繞著大型影片語言模型（LVLMs）的最新進展，這些模型能夠直接處理影片內容以表示它進行檢索，並與查詢一起無縫集成檢索的影片。我們通過實驗驗證了VideoRAG的有效性，展示它優於相關基準。

English

Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.

VideoRAG：檢索增強生成視頻語料庫

VideoRAG: Retrieval-Augmented Generation over Video Corpus

摘要

Summary

Support