擴散自我蒸餾用於零樣本定制圖像生成

摘要

文字到圖像擴散模型產生令人印象深刻的結果，但對於希望精細控制的藝術家來說，這些工具令人沮喪。例如，一個常見的用例是在新的情境中創建特定實例的圖像，即「保持身份生成」。這種情況，以及許多其他任務（例如，重新照明），都非常適合圖像+文字條件生成模型。然而，目前缺乏高質量的配對數據來直接訓練這樣的模型。我們提出了擴散自我蒸餾，一種利用預先訓練的文字到圖像模型生成自己數據集以進行文本條件的圖像對圖像任務的方法。我們首先利用文字到圖像擴散模型的上下文生成能力來創建圖像網格，並在視覺語言模型的幫助下精心編輯一個大型配對數據集。然後，我們通過使用經過精心編輯的配對數據集，將文字到圖像模型微調為文本+圖像對圖像模型。我們展示了擴散自我蒸餾在廣泛的身份保留生成任務中優於現有的零樣本方法，並與每個實例調整技術競爭，而無需測試時優化。

English

Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

擴散自我蒸餾用於零樣本定制圖像生成

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

摘要

Summary

Support