FramePainter: ビデオ拡散を用いたインタラクティブ画像編集への付与事前情報

要旨

インタラクティブ画像編集は、描画、クリック、ドラッグなどの視覚的相互作用操作を通じてユーザーが画像を変更できるようにします。既存の手法は、物体がさまざまな物理的相互作用でどのように変化するかを捉えるビデオからこのような監督信号を構築します。しかし、これらのモデルは通常、テキストから画像への拡散モデルに基づいて構築されるため、(i) 膨大なトレーニングサンプルと(ii) 実世界のダイナミクスと視覚的一貫性を学習するための追加の参照エンコーダが必要です。本論文では、このタスクを画像からビデオへの生成問題として再定式化し、トレーニングコストを削減し、時間的一貫性を確保するために強力なビデオ拡散先行事項を継承することで、FramePainterというこの定式化の効率的な具体化を紹介します。Stable Video Diffusionで初期化されたFramePainterは、軽量なスパース制御エンコーダのみを使用して編集信号を注入します。2つのフレーム間の大きな動きを処理するための時間的注意の制限を考慮し、編集された画像トークンとソース画像トークンとの密な対応を促進しながら、受容野を拡大するためのマッチングアテンションを提案します。FramePainterの効果的かつ効率的な性能を、さまざまな編集信号にわたって強調します。これは、過去の最先端技術を大幅に上回り、トレーニングデータをはるかに少なく使用して、画像の高度なシームレスで一貫した編集を実現します。例えば、カップの反射を自動的に調整します。さらに、FramePainterは、実世界のビデオには存在しないシナリオでも例外的な汎化能力を示し、クマノミをサメのような形に変換します。私たちのコードは、https://github.com/YBYBZhang/FramePainter で入手可能です。

English

Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.

FramePainter: ビデオ拡散を用いたインタラクティブ画像編集への付与事前情報

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

要旨

Summary

Support