Vidi：面向視頻理解與編輯的大型多模態模型

摘要

人類自然會與其聯繫的人分享信息，而視頻已成為互聯網上交流和表達的主要媒介之一。為了支持高質量大規模視頻內容的創作，現代流程需要對原始輸入材料（例如，由攝像機捕捉的未編輯素材）和編輯組件（例如，視覺效果）有全面的理解。在視頻編輯場景中，模型必須處理多種模態（例如，視覺、音頻、文本），並具備強大的背景知識，同時處理靈活的輸入長度（例如，長達一小時的原始視頻），這對傳統模型提出了重大挑戰。在本報告中，我們介紹了Vidi，這是一個用於廣泛視頻理解編輯場景的大型多模態模型（LMM）家族。首次發布的重點是時間檢索，即識別輸入視頻中與給定文本查詢相對應的時間範圍，這在智能編輯中起著至關重要的作用。該模型能夠處理長達一小時的視頻，並具備強大的時間理解能力，例如，檢索某些查詢的時間範圍。為了支持在現實場景中的全面評估，我們還提出了VUE-TR基準，該基準引入了五個關鍵改進。1）視頻時長：顯著長於現有的時間檢索數據集，2）音頻支持：包括基於音頻的查詢，3）查詢格式：多樣的查詢長度/格式，4）註釋質量：真實時間範圍由人工註釋，5）評估指標：改進的IoU指標以支持多個時間範圍的評估。值得注意的是，Vidi在時間檢索任務上顯著優於領先的專有模型，例如GPT-4o和Gemini，表明其在視頻編輯場景中的優越性。

English

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

Vidi：面向視頻理解與編輯的大型多模態模型

Vidi: Large Multimodal Models for Video Understanding and Editing

摘要

Summary

Support

Support