视频作者：迈向长篇故事视频生成

摘要

最近的视频生成模型展示了在生成持续几秒钟的高质量视频剪辑方面的有希望的结果。然而，这些模型在生成传达清晰和信息丰富事件的长序列方面面临挑战，从而限制了它们支持连贯叙述的能力。在本文中，我们提出了一个大规模烹饪视频数据集，旨在推动烹饪领域长篇叙事生成的发展。我们利用最先进的视觉语言模型（VLMs）和视频生成模型分别验证了我们提出的数据集在视觉保真度和文本描述准确性方面的质量。我们进一步引入了一个长篇叙事视频导演，以增强生成视频中的视觉和语义连贯性，并强调了调整视觉嵌入以实现整体视频质量改善的作用。我们的方法展示了在生成视觉详细且语义对齐的关键帧方面的显著改进，支持通过在视频生成过程中整合文本和图像嵌入的微调技术。项目页面：https://videoauteur.github.io/

English

Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/

视频作者：迈向长篇故事视频生成

VideoAuteur: Towards Long Narrative Video Generation

摘要

Support