ビデオオーター：長い物語ビデオ生成に向けて

要旨

最近のビデオ生成モデルは、数秒間続く高品質なビデオクリップを生成することで有望な結果を示しています。しかし、これらのモデルは長いシーケンスを生成する際に説明力のあるイベントを伝えることに課題を抱えており、一貫したナレーションをサポートする能力が制限されています。本論文では、料理領域における長編ナラティブ生成を推進するために設計された大規模な料理ビデオデータセットを提案します。我々は、提案されたデータセットの視覚的忠実度とテキストキャプションの精度を、最新のビジョン・ランゲージ・モデル（VLMs）とビデオ生成モデルを用いて検証します。さらに、ビジュアルと意味の一貫性を向上させるために長編ナラティブビデオディレクターを導入し、ビジュアル埋め込みを整合させる役割を強調します。我々の手法は、テキストと画像の埋め込みをビデオ生成プロセス内で統合する微調整技術によって、視覚的に詳細で意味的に整合したキーフレームの生成において著しい改善を示しています。プロジェクトページ: https://videoauteur.github.io/

English

Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/

ビデオオーター：長い物語ビデオ生成に向けて

VideoAuteur: Towards Long Narrative Video Generation

要旨

Summary

Support