STAR：利用文本到視頻模型的時空增強進行現實世界視頻超分辨率

摘要

影像擴散模型已被改編用於真實世界的視頻超分辨率，以應對基於GAN的方法中過度平滑的問題。然而，這些模型在保持時間一致性方面遇到困難，因為它們是在靜態圖像上訓練的，限制了它們有效捕捉時間動態的能力。將文本到視頻（T2V）模型整合到視頻超分辨率中以改善時間建模是直接的。然而，仍然存在兩個關鍵挑戰：在真實世界情境中複雜退化引入的異常，以及由於強大的T2V模型（例如CogVideoX-5B）的強生成能力而導致的妥協的保真度。為了增強恢復視頻的時空質量，我們提出了一種新方法，即STARS（用於真實世界視頻超分辨率的T2V模型的時空增強），利用T2V模型進行真實世界視頻超分辨率，實現逼真的空間細節和穩健的時間一致性。具體而言，我們在全局注意塊之前引入了局部信息增強模塊（LIEM），以豐富局部細節並減輕退化異常。此外，我們提出了一個動態頻率（DF）損失來加強保真度，引導模型在不同擴散步驟中專注於不同的頻率成分。大量實驗表明，STARS在合成和真實世界數據集上均優於最先進的方法。

English

Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (e.g., CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce~\name (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate~\name~outperforms state-of-the-art methods on both synthetic and real-world datasets.

STAR：利用文本到視頻模型的時空增強進行現實世界視頻超分辨率

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

摘要

Support