VideoPainter：即插即用的任意长度视频修复与编辑上下文控制技术

摘要

视频修复技术旨在恢复受损的视频内容，已取得显著进展。然而，现有方法，无论是通过光流和感受野先验传播未掩码区域像素，还是将图像修复模型在时间维度上扩展，均面临生成完全掩码对象或在一个模型中平衡背景上下文保留与前景生成这两大竞争目标的挑战。为克服这些局限，我们提出了一种新颖的双流范式VideoPainter，它包含一个高效的上下文编码器（仅占骨干网络参数的6%），用于处理掩码视频，并将骨干网络感知的背景上下文线索注入任何预训练的视频DiT中，以即插即用的方式生成语义一致的内容。这种架构分离显著降低了模型的学习复杂度，同时实现了关键背景上下文的精细整合。我们还引入了一种新颖的目标区域ID重采样技术，支持任意长度视频的修复，极大提升了实际应用价值。此外，我们建立了一个可扩展的数据集管道，利用当前视觉理解模型，贡献了VPData和VPBench，以促进基于分割的修复训练与评估，这是迄今为止最大的视频修复数据集和基准，包含超过39万个多样化的视频片段。以修复为管道基础，我们还探索了包括视频编辑和视频编辑对数据生成在内的下游应用，展示了竞争性的性能和巨大的实用潜力。大量实验证明，VideoPainter在任意长度视频修复和编辑方面均表现出色，在视频质量、掩码区域保持及文本连贯性等八项关键指标上均领先。

English

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.

VideoPainter：即插即用的任意长度视频修复与编辑上下文控制技术

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

摘要

Summary

Support