Vidi:面向視頻理解與編輯的大型多模態模型
Vidi: Large Multimodal Models for Video Understanding and Editing
April 22, 2025
作者: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu
cs.AI
摘要
人類自然會與其聯繫的人分享信息,而視頻已成為互聯網上交流和表達的主要媒介之一。為了支持高質量大規模視頻內容的創作,現代流程需要對原始輸入材料(例如,由攝像機捕捉的未編輯素材)和編輯組件(例如,視覺效果)有全面的理解。在視頻編輯場景中,模型必須處理多種模態(例如,視覺、音頻、文本),並具備強大的背景知識,同時處理靈活的輸入長度(例如,長達一小時的原始視頻),這對傳統模型提出了重大挑戰。在本報告中,我們介紹了Vidi,這是一個用於廣泛視頻理解編輯場景的大型多模態模型(LMM)家族。首次發布的重點是時間檢索,即識別輸入視頻中與給定文本查詢相對應的時間範圍,這在智能編輯中起著至關重要的作用。該模型能夠處理長達一小時的視頻,並具備強大的時間理解能力,例如,檢索某些查詢的時間範圍。為了支持在現實場景中的全面評估,我們還提出了VUE-TR基準,該基準引入了五個關鍵改進。1)視頻時長:顯著長於現有的時間檢索數據集,2)音頻支持:包括基於音頻的查詢,3)查詢格式:多樣的查詢長度/格式,4)註釋質量:真實時間範圍由人工註釋,5)評估指標:改進的IoU指標以支持多個時間範圍的評估。值得注意的是,Vidi在時間檢索任務上顯著優於領先的專有模型,例如GPT-4o和Gemini,表明其在視頻編輯場景中的優越性。
English
Humans naturally share information with those they are connected to, and
video has become one of the dominant mediums for communication and expression
on the Internet. To support the creation of high-quality large-scale video
content, a modern pipeline requires a comprehensive understanding of both the
raw input materials (e.g., the unedited footage captured by cameras) and the
editing components (e.g., visual effects). In video editing scenarios, models
must process multiple modalities (e.g., vision, audio, text) with strong
background knowledge and handle flexible input lengths (e.g., hour-long raw
videos), which poses significant challenges for traditional models. In this
report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a
wide range of video understand editing scenarios. The first release focuses on
temporal retrieval, i.e., identifying the time ranges within the input videos
corresponding to a given text query, which plays a critical role in intelligent
editing. The model is capable of processing hour-long videos with strong
temporal understanding capability, e.g., retrieve time ranges for certain
queries. To support a comprehensive evaluation in real-world scenarios, we also
present the VUE-TR benchmark, which introduces five key advancements. 1) Video
duration: significantly longer than existing temporal retrival datasets, 2)
Audio support: includes audio-based queries, 3) Query format: diverse query
lengths/formats, 4) Annotation quality: ground-truth time ranges are manually
annotated. 5) Evaluation metric: a refined IoU metric to support evaluation
over multiple time ranges. Remarkably, Vidi significantly outperforms leading
proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task,
indicating its superiority in video editing scenarios.Summary
AI-Generated Summary