FramePainter:賦予互動式圖像編輯與視訊擴散先驗
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
January 14, 2025
作者: Yabo Zhang, Xinpeng Zhou, Yihan Zeng, Hang Xu, Hui Li, Wangmeng Zuo
cs.AI
摘要
互動式圖像編輯允許使用者通過視覺互動操作,如繪畫、點擊和拖曳來修改圖像。現有方法從視頻中構建這些監督信號,因為它們捕捉了物體如何隨著各種物理交互而變化。然而,這些模型通常建立在文本到圖像擴散模型的基礎上,因此需要(i)大量的訓練樣本和(ii)一個額外的參考編碼器來學習現實世界的動態和視覺一致性。在本文中,我們將這個任務重新定義為一個圖像到視頻生成問題,以繼承強大的視頻擴散先驗,以降低訓練成本並確保時間一致性。具體而言,我們介紹了FramePainter作為這種形式化的高效實例。通過穩定的視頻擴散初始化,它僅使用輕量級的稀疏控制編碼器來注入編輯信號。考慮到時間注意力在處理兩幀之間的大運動時的限制,我們進一步提出匹配注意力以擴大感受野,同時鼓勵編輯和源圖像令牌之間的密集對應。我們強調了FramePainter在各種編輯信號上的有效性和效率:它在遠少於以前最先進方法的訓練數據的情況下,主要優於它們,實現了高度無縫和一致的圖像編輯,例如,自動調整杯子的反射。此外,FramePainter在現實世界視頻中不存在的情境中也展示出卓越的泛化能力,例如,將小丑魚變換為類似鯊魚的形狀。我們的代碼將在 https://github.com/YBYBZhang/FramePainter 上提供。
English
Interactive image editing allows users to modify images through visual
interaction operations such as drawing, clicking, and dragging. Existing
methods construct such supervision signals from videos, as they capture how
objects change with various physical interactions. However, these models are
usually built upon text-to-image diffusion models, so necessitate (i) massive
training samples and (ii) an additional reference encoder to learn real-world
dynamics and visual consistency. In this paper, we reformulate this task as an
image-to-video generation problem, so that inherit powerful video diffusion
priors to reduce training costs and ensure temporal consistency. Specifically,
we introduce FramePainter as an efficient instantiation of this formulation.
Initialized with Stable Video Diffusion, it only uses a lightweight sparse
control encoder to inject editing signals. Considering the limitations of
temporal attention in handling large motion between two frames, we further
propose matching attention to enlarge the receptive field while encouraging
dense correspondence between edited and source image tokens. We highlight the
effectiveness and efficiency of FramePainter across various of editing signals:
it domainantly outperforms previous state-of-the-art methods with far less
training data, achieving highly seamless and coherent editing of images, \eg,
automatically adjust the reflection of the cup. Moreover, FramePainter also
exhibits exceptional generalization in scenarios not present in real-world
videos, \eg, transform the clownfish into shark-like shape. Our code will be
available at https://github.com/YBYBZhang/FramePainter.Summary
AI-Generated Summary