MultiVENT 2.0: 이벤트 중심 비디오 검색을 위한 대규모 다국어 벤치마크

초록

대규모 다중 모달 컬렉션에서 정보를 효율적으로 검색하고 종합하는 것은 중요한 과제가 되었습니다. 그러나 기존의 비디오 검색 데이터셋은 범위 제한으로 고통받고 있으며, 주로 설명적이지만 모호한 쿼리와 소규모의 전문적으로 편집된 영어 중심 비디오 컬렉션을 대상으로 합니다. 이러한 공백을 해결하기 위해 우리는 218,000개 이상의 뉴스 비디오와 3,906개의 특정 세계 이벤트를 대상으로 하는 쿼리를 특징으로 하는 대규모, 다국어 이벤트 중심 비디오 검색 벤치마크인 MultiVENT 2.0을 소개합니다. 이러한 쿼리는 비디오의 시각적 콘텐츠, 오디오, 포함된 텍스트 및 텍스트 메타데이터에서 찾을 수 있는 정보를 명확히 목표로 합니다. 시스템이 이 작업을 성공적으로 수행하려면 이러한 소스 모두를 활용해야 합니다. 예비 결과는 최첨단 비전-언어 모델이 이 작업에 심각하게 어려움을 겪는 것을 보여주며, 대안적 접근 방식이 희망을 보이지만 여전히 이 문제를 충분히 해결하기에는 부족하다는 것을 보여줍니다. 이러한 발견은 더 견고한 다중 모달 검색 시스템이 필요하다는 점을 강조하며, 효과적인 비디오 검색은 다중 모달 콘텐츠 이해 및 생성 작업을 위한 중요한 단계임을 재확인합니다.

English

Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce MultiVENT 2.0, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation tasks.

MultiVENT 2.0: 이벤트 중심 비디오 검색을 위한 대규모 다국어 벤치마크

MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

초록

Support