OmnimatteZero: Omnimatte in tempo reale senza addestramento con modelli di diffusione video pre-addestrati

Abstract

Omnimatte mira a scomporre un video dato in strati semanticamente significativi, includendo lo sfondo e i singoli oggetti insieme ai loro effetti associati, come ombre e riflessi. I metodi esistenti spesso richiedono un addestramento esteso o una costosa ottimizzazione auto-supervisionata. In questo articolo, presentiamo OmnimatteZero, un approccio senza addestramento che sfrutta modelli di diffusione video pre-addestrati per omnimatte. Questo metodo può rimuovere oggetti dai video, estrarre strati di singoli oggetti insieme ai loro effetti e comporre tali oggetti su nuovi video. Raggiungiamo questo obiettivo adattando tecniche di inpainting di immagini zero-shot per la rimozione di oggetti nei video, un compito che queste tecniche non gestiscono efficacemente di default. Mostriamo inoltre che le mappe di self-attention catturano informazioni sull'oggetto e sulle sue tracce, utilizzandole per inpaintare gli effetti dell'oggetto, lasciando uno sfondo pulito. Inoltre, attraverso semplici operazioni aritmetiche nello spazio latente, gli strati degli oggetti possono essere isolati e ricombinati senza soluzione di continuità con nuovi strati video per produrre nuovi video. Le valutazioni dimostrano che OmnimatteZero non solo raggiunge prestazioni superiori in termini di ricostruzione dello sfondo, ma stabilisce anche un nuovo record per l'approccio Omnimatte più veloce, ottenendo prestazioni in tempo reale con un tempo di elaborazione minimo per fotogramma.

English

Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

OmnimatteZero: Omnimatte in tempo reale senza addestramento con modelli di diffusione video pre-addestrati

OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

Abstract

Summary

Support

Support