OmniCreator:具有通用編輯的自監督統一生成
OmniCreator: Self-Supervised Unified Generation with Universal Editing
December 3, 2024
作者: Haodong Chen, Lan Wang, Harry Yang, Ser-Nam Lim
cs.AI
摘要
我們介紹了一個新穎的框架 OmniCreator,可以在同一平台上進行由文字提示驅動的統一(圖像+視頻)生成和編輯。OmniCreator通過自監督方式獲取生成和通用編輯能力,以原始的文字-視頻對作為條件,同時利用相同的視頻作為去噪目標,學習視頻與文字之間的語義對應。在推斷時,當提供文字提示和視頻時,OmniCreator能夠生成忠實於兩者的目標,實現一種無限制的通用編輯效果,與現有主要專注於特定編輯類型或依賴額外控制(例如結構條件、注意特徵或DDIM反演)的編輯工作相對。另一方面,當僅提供文字提示時,OmniCreator變為生成型,通過學習的語義對應產生高質量的視頻。重要的是,我們發現相同的能力也適用於圖像,使OmniCreator成為一個真正統一的框架。此外,由於缺乏現有的生成式視頻編輯基準,我們介紹了 OmniBench-99 數據集,旨在全面評估生成式視頻編輯模型的性能。大量實驗表明,OmniCreator在所有其他模型上表現出顯著的優越性。
English
We introduce OmniCreator, a novel framework that can conduct text-prompted
unified (image+video) generation as well as editing all in one place.
OmniCreator acquires generative and universal editing capabilities in a
self-supervised manner, taking original text-video pairs as conditions while
utilizing the same video as a denoising target to learn the semantic
correspondence between video and text. During inference, when presented with a
text prompt and a video, OmniCreator is capable of generating a target that is
faithful to both, achieving a universal editing effect that is unconstrained as
opposed to existing editing work that primarily focuses on certain editing
types or relies on additional controls (e.g., structural conditions, attention
features, or DDIM inversion). On the other hand, when presented with a text
prompt only, OmniCreator becomes generative, producing high-quality video as a
result of the semantic correspondence learned. Importantly, we found that the
same capabilities extend to images as is, making OmniCreator a truly unified
framework. Further, due to the lack of existing generative video editing
benchmarks, we introduce the OmniBench-99 dataset, designed to evaluate the
performance of generative video editing models comprehensively. Extensive
experiments demonstrate that OmniCreator exhibits substantial superiority over
all other models.Summary
AI-Generated Summary