MIVE: 다중 인스턴스 비디오 편집을 위한 새로운 디자인과 벤치마크

초록

최근 AI 기반 비디오 편집 기술은 사용자가 간단한 텍스트 프롬프트를 통해 비디오를 편집할 수 있게 하여 편집 프로세스를 크게 간소화했습니다. 그러나 최근 제로샷 비디오 편집 기술은 주로 전역 또는 단일 객체 편집에 초점을 맞추어 다른 부분에서 의도하지 않은 변경을 일으킬 수 있습니다. 여러 객체에 지역화된 편집이 필요한 경우 기존 방법은 충실하지 못한 편집, 편집 유출 및 적합한 평가 데이터셋 및 메트릭의 부재와 같은 도전에 직면합니다. 이러한 제한을 극복하기 위해 우리는 제로샷 다중 인스턴스 비디오 편집 프레임워크인 MIVE를 제안합니다. MIVE는 특정 객체(예: 사람)에 특화되지 않은 일반적인 마스크 기반 프레임워크입니다. MIVE는 편집 유출을 방지하기 위한 Disentangled Multi-instance Sampling (DMS) 및 정확한 지역화와 충실한 편집을 보장하기 위한 Instance-centric Probability Redistribution (IPR)이라는 두 가지 주요 모듈을 도입합니다. 또한, 다양한 비디오 시나리오를 제공하는 새로운 MIVE 데이터셋을 소개하고, 다중 인스턴스 비디오 편집 작업에서 편집 유출을 평가하기 위한 Cross-Instance Accuracy (CIA) 점수를 도입합니다. 우리의 포괄적인 질적, 양적 및 사용자 연구 평가는 MIVE가 편집 충실성, 정확성 및 유출 방지 측면에서 최근 최첨단 기법을 크게 능가함을 보여주며, 다중 인스턴스 비디오 편집에 대한 새로운 기준을 제시합니다. 프로젝트 페이지는 https://kaist-viclab.github.io/mive-site/에서 확인할 수 있습니다.

English

Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose a zero-shot Multi-Instance Video Editing framework, called MIVE. MIVE is a general-purpose mask-based framework, not dedicated to specific objects (e.g., people). MIVE introduces two key modules: (i) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and (ii) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing. Additionally, we present our new MIVE Dataset featuring diverse video scenarios and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that MIVE significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing. The project page is available at https://kaist-viclab.github.io/mive-site/

MIVE: 다중 인스턴스 비디오 편집을 위한 새로운 디자인과 벤치마크

MIVE: New Design and Benchmark for Multi-Instance Video Editing

초록

Support