VideoRAG:檢索增強生成視頻語料庫
VideoRAG: Retrieval-Augmented Generation over Video Corpus
January 10, 2025
作者: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
cs.AI
摘要
檢索增強生成(RAG)是解決基礎模型中生成事實錯誤輸出問題的強大策略,它通過檢索與查詢相關的外部知識並將其納入生成過程中來解決這個問題。然而,現有的RAG方法主要集中在文本信息上,最近一些進展開始考慮圖像,但它們在很大程度上忽略了影片,這是一個豐富的多模式知識來源,能夠更有效地表示事件、過程和情境細節,優於其他模態。雖然最近有一些研究探索了在回應生成過程中整合影片,但它們要麼預先定義了與查詢相關的影片而沒有根據查詢檢索它們,要麼將影片轉換為文本描述而沒有利用它們的多模式豐富性。為了應對這些問題,我們介紹了VideoRAG,這是一個新穎的框架,不僅根據與查詢相關性動態檢索相關影片,還利用影片的視覺和文本信息進行輸出生成。此外,為了實現這一點,我們的方法圍繞著大型影片語言模型(LVLMs)的最新進展,這些模型能夠直接處理影片內容以表示它進行檢索,並與查詢一起無縫集成檢索的影片。我們通過實驗驗證了VideoRAG的有效性,展示它優於相關基準。
English
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the
issue of generating factually incorrect outputs in foundation models by
retrieving external knowledge relevant to queries and incorporating it into
their generation process. However, existing RAG approaches have primarily
focused on textual information, with some recent advancements beginning to
consider images, and they largely overlook videos, a rich source of multimodal
knowledge capable of representing events, processes, and contextual details
more effectively than any other modality. While a few recent studies explore
the integration of videos in the response generation process, they either
predefine query-associated videos without retrieving them according to queries,
or convert videos into the textual descriptions without harnessing their
multimodal richness. To tackle these, we introduce VideoRAG, a novel framework
that not only dynamically retrieves relevant videos based on their relevance
with queries but also utilizes both visual and textual information of videos in
the output generation. Further, to operationalize this, our method revolves
around the recent advance of Large Video Language Models (LVLMs), which enable
the direct processing of video content to represent it for retrieval and
seamless integration of the retrieved videos jointly with queries. We
experimentally validate the effectiveness of VideoRAG, showcasing that it is
superior to relevant baselines.Summary
AI-Generated Summary