영상 생성을 위한 제로샷 맞춤형 확산 셀프 증류

초록

텍스트-이미지 확산 모델은 인상적인 결과를 얻지만 섬세한 제어를 원하는 예술가들에게는 괴로운 도구입니다. 예를 들어, 흔한 사용 사례는 새로운 맥락에서 특정 사례의 이미지를 생성하는 것, 즉 "정체성 보존 생성"입니다. 이러한 설정은 조명 변경(relighting)과 같은 다른 많은 작업들과 함께 이미지+텍스트 조건부 생성 모델에 자연스럽게 부합합니다. 그러나 이러한 모델을 직접 훈련시키기에는 고품질의 페어 데이터가 부족합니다. 저희는 Diffusion Self-Distillation을 제안합니다. 이는 사전 훈련된 텍스트-이미지 모델을 활용하여 텍스트 조건부 이미지-이미지 작업을 위한 자체 데이터셋을 생성하는 방법입니다. 우리는 먼저 텍스트-이미지 확산 모델의 맥락 내 생성 능력을 활용하여 이미지 그리드를 생성하고 Visual-Language 모델의 도움으로 대규모 페어 데이터셋을 선별합니다. 그런 다음 이를 사용하여 선별된 페어 데이터셋을 활용하여 텍스트+이미지-이미지 모델로 세밀하게 조정합니다. 우리는 Diffusion Self-Distillation이 기존의 제로샷 방법을 능가하고 테스트 시간 최적화 없이 다양한 정체성 보존 생성 작업에서 인스턴스 조정 기술과 경쟁력을 갖는 것을 보여줍니다.

English

Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

영상 생성을 위한 제로샷 맞춤형 확산 셀프 증류

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

초록

Support