Vidi：面向视频理解与编辑的大型多模态模型

摘要

人类自然倾向于与所连接的人分享信息，而视频已成为互联网上沟通与表达的主要媒介之一。为了支持高质量大规模视频内容的创作，现代处理流程需要对原始输入材料（如摄像机捕捉的未编辑素材）和编辑组件（如视觉特效）有全面的理解。在视频编辑场景中，模型必须处理多种模态（如视觉、音频、文本），具备深厚的背景知识，并能应对灵活的输入长度（如长达数小时的原始视频），这对传统模型构成了重大挑战。在本报告中，我们介绍了Vidi，一个面向广泛视频理解编辑场景的大型多模态模型（LMM）家族。首次发布聚焦于时间检索，即识别输入视频中与给定文本查询对应的时间范围，这在智能编辑中扮演着关键角色。该模型能够处理长达数小时的视频，具备强大的时间理解能力，例如为特定查询检索时间范围。为了支持现实场景中的全面评估，我们还推出了VUE-TR基准，该基准引入了五项关键改进：1）视频时长：远超现有时间检索数据集；2）音频支持：包含基于音频的查询；3）查询格式：多样化的查询长度与格式；4）标注质量：真实时间范围由人工标注；5）评估指标：改进的IoU指标，支持多时间范围的评估。值得注意的是，Vidi在时间检索任务上显著超越了领先的专有模型，如GPT-4o和Gemini，彰显了其在视频编辑场景中的卓越性能。

English

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

Vidi：面向视频理解与编辑的大型多模态模型

Vidi: Large Multimodal Models for Video Understanding and Editing

摘要

Summary

Support

Support