Vidi:面向视频理解与编辑的大型多模态模型
Vidi: Large Multimodal Models for Video Understanding and Editing
April 22, 2025
作者: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu
cs.AI
摘要
人类自然倾向于与所连接的人分享信息,而视频已成为互联网上沟通与表达的主要媒介之一。为了支持高质量大规模视频内容的创作,现代处理流程需要对原始输入材料(如摄像机捕捉的未编辑素材)和编辑组件(如视觉特效)有全面的理解。在视频编辑场景中,模型必须处理多种模态(如视觉、音频、文本),具备深厚的背景知识,并能应对灵活的输入长度(如长达数小时的原始视频),这对传统模型构成了重大挑战。在本报告中,我们介绍了Vidi,一个面向广泛视频理解编辑场景的大型多模态模型(LMM)家族。首次发布聚焦于时间检索,即识别输入视频中与给定文本查询对应的时间范围,这在智能编辑中扮演着关键角色。该模型能够处理长达数小时的视频,具备强大的时间理解能力,例如为特定查询检索时间范围。为了支持现实场景中的全面评估,我们还推出了VUE-TR基准,该基准引入了五项关键改进:1)视频时长:远超现有时间检索数据集;2)音频支持:包含基于音频的查询;3)查询格式:多样化的查询长度与格式;4)标注质量:真实时间范围由人工标注;5)评估指标:改进的IoU指标,支持多时间范围的评估。值得注意的是,Vidi在时间检索任务上显著超越了领先的专有模型,如GPT-4o和Gemini,彰显了其在视频编辑场景中的卓越性能。
English
Humans naturally share information with those they are connected to, and
video has become one of the dominant mediums for communication and expression
on the Internet. To support the creation of high-quality large-scale video
content, a modern pipeline requires a comprehensive understanding of both the
raw input materials (e.g., the unedited footage captured by cameras) and the
editing components (e.g., visual effects). In video editing scenarios, models
must process multiple modalities (e.g., vision, audio, text) with strong
background knowledge and handle flexible input lengths (e.g., hour-long raw
videos), which poses significant challenges for traditional models. In this
report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a
wide range of video understand editing scenarios. The first release focuses on
temporal retrieval, i.e., identifying the time ranges within the input videos
corresponding to a given text query, which plays a critical role in intelligent
editing. The model is capable of processing hour-long videos with strong
temporal understanding capability, e.g., retrieve time ranges for certain
queries. To support a comprehensive evaluation in real-world scenarios, we also
present the VUE-TR benchmark, which introduces five key advancements. 1) Video
duration: significantly longer than existing temporal retrival datasets, 2)
Audio support: includes audio-based queries, 3) Query format: diverse query
lengths/formats, 4) Annotation quality: ground-truth time ranges are manually
annotated. 5) Evaluation metric: a refined IoU metric to support evaluation
over multiple time ranges. Remarkably, Vidi significantly outperforms leading
proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task,
indicating its superiority in video editing scenarios.Summary
AI-Generated Summary