VideoRAG:检索增强视频语料生成
VideoRAG: Retrieval-Augmented Generation over Video Corpus
January 10, 2025
作者: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
cs.AI
摘要
检索增强生成(RAG)是一种强大的策略,用于解决基础模型生成事实不准确输出的问题,通过检索与查询相关的外部知识并将其合并到生成过程中。然而,现有的RAG方法主要专注于文本信息,最近一些进展开始考虑图像,但很大程度上忽视了视频,这是一种丰富的多模态知识源,能够更有效地表示事件、过程和上下文细节,胜过其他模态。虽然最近一些研究探索了将视频整合到响应生成过程中,但它们要么预先定义与查询相关的视频而不根据查询检索它们,要么将视频转换为文本描述而未利用其多模态丰富性。为了解决这些问题,我们引入了VideoRAG,这是一个新颖的框架,不仅可以根据其与查询的相关性动态检索相关视频,还可以在输出生成中利用视频的视觉和文本信息。此外,为了实现这一目标,我们的方法围绕着大型视频语言模型(LVLMs)的最新进展,这些模型可以直接处理视频内容以表示检索和检索的视频与查询的无缝整合。我们通过实验证实了VideoRAG的有效性,展示其优于相关基线。
English
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the
issue of generating factually incorrect outputs in foundation models by
retrieving external knowledge relevant to queries and incorporating it into
their generation process. However, existing RAG approaches have primarily
focused on textual information, with some recent advancements beginning to
consider images, and they largely overlook videos, a rich source of multimodal
knowledge capable of representing events, processes, and contextual details
more effectively than any other modality. While a few recent studies explore
the integration of videos in the response generation process, they either
predefine query-associated videos without retrieving them according to queries,
or convert videos into the textual descriptions without harnessing their
multimodal richness. To tackle these, we introduce VideoRAG, a novel framework
that not only dynamically retrieves relevant videos based on their relevance
with queries but also utilizes both visual and textual information of videos in
the output generation. Further, to operationalize this, our method revolves
around the recent advance of Large Video Language Models (LVLMs), which enable
the direct processing of video content to represent it for retrieval and
seamless integration of the retrieved videos jointly with queries. We
experimentally validate the effectiveness of VideoRAG, showcasing that it is
superior to relevant baselines.Summary
AI-Generated Summary