視頻作者：邁向長篇敘事視頻生成

摘要

最近的影片生成模型展示了在製作持續數秒的高品質影片片段方面的有希望結果。然而，這些模型在生成傳達清晰且資訊豐富事件的長序列方面面臨挑戰，限制了它們支持連貫敘事的能力。本文中，我們提出了一個大規模烹飪影片數據集，旨在推進烹飪領域的長篇敘事生成。我們使用最先進的視覺語言模型（VLMs）和影片生成模型分別驗證了我們提出的數據集在視覺保真度和文本標題準確性方面的質量。我們進一步引入了一個長篇敘事影片導演，以增強生成影片中的視覺和語義連貫性，並強調了對齊視覺嵌入以實現整體影片質量改善的作用。我們的方法展示了在生成視覺細節豐富且語義對齊的關鍵幀方面取得了顯著進展，這得益於在影片生成過程中整合文本和圖像嵌入的微調技術。項目頁面：https://videoauteur.github.io/

English

Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/

視頻作者：邁向長篇敘事視頻生成

VideoAuteur: Towards Long Narrative Video Generation

摘要

Summary

Support