SMITE: Segmentami nel Tempo

Abstract

La segmentazione di un oggetto in un video presenta sfide significative. Ogni pixel deve essere etichettato con precisione, e queste etichette devono rimanere coerenti attraverso i frame. La difficoltà aumenta quando la segmentazione avviene con una granularità arbitraria, il che significa che il numero di segmenti può variare arbitrariamente, e le maschere sono definite basandosi solo su una o poche immagini campione. In questo articolo, affrontiamo questo problema utilizzando un modello di diffusione testo-immagine pre-addestrato integrato con un meccanismo di tracciamento aggiuntivo. Dimostriamo che il nostro approccio può gestire efficacemente vari scenari di segmentazione e superare le alternative all'avanguardia.

English

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.

SMITE: Segmentami nel Tempo

SMITE: Segment Me In TimE

Abstract

Support