VideoRepair:通過錯位評估和局部細化改進文本到視頻生成

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

November 22, 2024
作者: Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal
cs.AI

摘要

最近的文本轉視頻(T2V)擴散模型展示了在各個領域具有印象深刻的生成能力。然而,這些模型通常生成的視頻與文本提示存在錯位,特別是當提示描述具有多個物體和屬性的複雜場景時。為了解決這個問題,我們引入了VideoRepair,這是一個新穎的、與模型無關且無需訓練的視頻精細化框架,它可以自動識別細粒度的文本-視頻錯位並生成明確的空間和文本反饋,從而使T2V擴散模型能夠執行有針對性的、局部的精煉。VideoRepair 包括四個階段:在(1)視頻評估中,我們通過生成細粒度評估問題並使用MLLM回答這些問題來檢測錯位。在(2)精煉規劃中,我們識別準確生成的物體,然後創建局部提示以精煉視頻中的其他區域。接下來,在(3)區域分解中,我們使用結合的接地模塊對正確生成的區域進行分割。我們通過在(4)局部精煉中調整錯位區域並保留正確區域來重新生成視頻。在兩個流行的視頻生成基準測試(EvalCrafter 和 T2V-CompBench)中,VideoRepair 在各種文本-視頻對齊指標上明顯優於最近的基線。我們對VideoRepair組件進行了全面分析並提供了定性示例。
English
Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

Summary

AI-Generated Summary

PDF93November 25, 2024