VideoRepair：通过错位评估和局部细化改进文本到视频生成

摘要

最近的文本到视频（T2V）扩散模型展示了在各个领域具有印象深刻的生成能力。然而，这些模型经常生成与文本提示不一致的视频，特别是当提示描述具有多个对象和属性的复杂场景时。为了解决这个问题，我们引入了VideoRepair，这是一个新颖的、与模型无关且无需训练的视频细化框架，能够自动识别细粒度的文本-视频不一致，并生成明确的空间和文本反馈，使得T2V扩散模型能够执行有针对性的、局部的细化。VideoRepair包括四个阶段：在（1）视频评估中，我们通过生成细粒度评估问题并用MLLM回答这些问题来检测不一致。在（2）细化规划中，我们识别准确生成的对象，然后创建局部提示来细化视频中的其他区域。接下来，在（3）区域分解中，我们使用组合接地模块对正确生成的区域进行分割。我们通过在（4）局部细化中调整不一致的区域并保留正确区域来重新生成视频。在两个流行的视频生成基准（EvalCrafter和T2V-CompBench）上，VideoRepair在各种文本-视频对齐度量方面明显优于最近的基线。我们对VideoRepair的组件和定性示例进行了全面分析。

English

Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

VideoRepair：通过错位评估和局部细化改进文本到视频生成

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

摘要

Summary

Support

Support