ChatPaper.aiChatPaper

視訊導引:透過教師指導改善視訊擴散模型的方法

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

October 6, 2024
作者: Dohun Lee, Bryan S Kim, Geon Yeong Park, Jong Chul Ye
cs.AI

摘要

文字到圖像(T2I)擴散模型已經改變了視覺內容的創作方式,但將這些能力擴展到文字到視頻(T2V)生成仍然是一個挑戰,特別是在保持時間一致性方面。現有的旨在提高一致性的方法通常會導致降低影像質量和不切實際的計算時間等折衷。為了應對這些問題,我們引入了VideoGuide,這是一個新穎的框架,可以增強預訓練的T2V模型的時間一致性,而無需進行額外的訓練或微調。相反,VideoGuide在推論的早期階段利用任何預訓練的視頻擴散模型(VDM)或自身作為指導,通過將引導模型的去噪樣本插值到抽樣模型的去噪過程中,從而提高時間質量。所提出的方法顯著改善了時間一致性和圖像保真度,提供了一個成本效益且實用的解決方案,能夠協同各種視頻擴散模型的優勢。此外,我們展示了先前的蒸餾,揭示了通過所提出的方法,基礎模型可以利用引導模型的優越數據先驗來實現增強的文本連貫性。項目頁面:http://videoguide2025.github.io/
English
Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: http://videoguide2025.github.io/

Summary

AI-Generated Summary

PDF303November 16, 2024