Instruct-CLIP: Miglioramento dell'Editing di Immagini Guidato da Istruzioni con Affinamento Automatico dei Dati Utilizzando l'Apprendimento Contrastivo

Abstract

Sebbene le istruzioni in linguaggio naturale offrano un modo intuitivo per guidare la modifica automatica delle immagini, i modelli di deep learning spesso faticano a ottenere risultati di alta qualità, principalmente a causa delle difficoltà nella creazione di grandi dataset di addestramento di qualità elevata. I lavori precedenti si sono generalmente affidati a modelli generativi testo-immagine (T2I) per produrre coppie di immagini originali e modificate che simulano l'input/output di un modello di modifica delle immagini guidato da istruzioni. Tuttavia, queste coppie di immagini spesso non si allineano con le istruzioni di modifica specificate a causa delle limitazioni dei modelli T2I, il che influisce negativamente sui modelli addestrati su tali dataset. Per affrontare questo problema, presentiamo Instruct-CLIP, un metodo auto-supervisionato che apprende i cambiamenti semantici tra immagini originali e modificate per affinare e allineare meglio le istruzioni nei dataset esistenti. Inoltre, adattiamo Instruct-CLIP per gestire immagini latenti rumorose e passaggi temporali di diffusione, in modo che possa essere utilizzato per addestrare modelli di diffusione latente (LDMs) [19] e applicare in modo efficiente l'allineamento tra l'istruzione di modifica e i cambiamenti dell'immagine nello spazio latente in qualsiasi fase della pipeline di diffusione. Utilizziamo Instruct-CLIP per correggere il dataset InstructPix2Pix e ottenere oltre 120K campioni raffinati che poi usiamo per affinare il loro modello, guidati dalla nostra nuova funzione di perdita basata su Instruct-CLIP. Il modello risultante è in grado di produrre modifiche più allineate con le istruzioni fornite. Il nostro codice e il dataset sono disponibili all'indirizzo https://github.com/SherryXTChen/Instruct-CLIP.git.

English

Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.

Instruct-CLIP: Miglioramento dell'Editing di Immagini Guidato da Istruzioni con Affinamento Automatico dei Dati Utilizzando l'Apprendimento Contrastivo

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

Abstract

Summary

Support

Support