STAR：利用文本到视频模型进行空间-时间增强的真实世界视频超分辨率

摘要

图像扩散模型已经被调整用于实际视频超分辨率，以解决基于GAN方法的过度平滑问题。然而，这些模型在保持时间一致性方面存在困难，因为它们是在静态图像上训练的，限制了其有效捕捉时间动态的能力。将文本到视频（T2V）模型整合到视频超分辨率中以改善时间建模是直接的。然而，仍然存在两个关键挑战：在实际场景中引入的复杂退化引入的伪影，以及由于强大的T2V模型（例如CogVideoX-5B）的强大生成能力而导致的保真度受损。为了增强恢复视频的时空质量，我们介绍了\name（用于实际视频超分辨率的T2V模型的时空增强），这是一种利用T2V模型进行实际视频超分辨率的新方法，实现了逼真的空间细节和稳健的时间一致性。具体而言，我们在全局注意力块之前引入了局部信息增强模块（LIEM），以丰富局部细节并减轻退化伪影。此外，我们提出了动态频率（DF）损失来加强保真度，引导模型在扩散步骤中专注于不同频率成分。大量实验证明\name 在合成和实际数据集上均优于最先进的方法。

English

Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (e.g., CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce~\name (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate~\name~outperforms state-of-the-art methods on both synthetic and real-world datasets.

STAR：利用文本到视频模型进行空间-时间增强的真实世界视频超分辨率

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

摘要

Summary

Support