OmniEdit:透過專家監督打造圖像編輯通才模型

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

November 11, 2024
作者: Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen
cs.AI

摘要

指示引導的圖像編輯方法通過在自動合成或手動標註的圖像編輯對上訓練擴散模型,展現了顯著的潛力。然而,這些方法仍遠遠落後於實際應用。我們確定了導致這一差距的三個主要挑戰。首先,由於存在偏見的合成過程,現有模型的編輯能力有限。其次,這些方法是使用具有大量噪音和瑕疵的數據集進行訓練的。這是由於應用了像 CLIP-score 這樣的簡單過濾方法。第三,所有這些數據集都限制在單一低分辨率和固定長寬比,限制了應對真實世界用例的多功能性。在本文中,我們提出了 \omniedit,這是一個全能編輯器,可以無縫處理七種不同的圖像編輯任務,並支持任何長寬比。我們的貢獻有四個方面:(1) \omniedit 通過利用來自七個不同專業模型的監督進行訓練,以確保任務覆蓋範圍。(2) 我們利用基於大型多模型(如 GPT-4o)提供的分數的重要性抽樣,而不是 CLIP-score,以提高數據質量。(3) 我們提出了一種名為 EditNet 的新編輯架構,極大地提高了編輯成功率。(4) 我們提供了具有不同長寬比的圖像,以確保我們的模型可以處理野外的任何圖像。我們精心編制了一個測試集,其中包含具有不同長寬比的圖像,並附帶各種指示以涵蓋不同任務。自動評估和人工評估均表明,\omniedit 可以顯著優於所有現有模型。我們的代碼、數據集和模型將在以下網址提供:https://tiger-ai-lab.github.io/OmniEdit/
English
Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at https://tiger-ai-lab.github.io/OmniEdit/

Summary

AI-Generated Summary

PDF475November 12, 2024