FiVA: 텍스트에서 이미지로의 확산 모델을 위한 세밀한 시각적 속성 데이터셋

초록

텍스트에서 이미지로의 생성에 대한 최근 발전은 다양한 응용 프로그램을 갖춘 고품질 이미지의 생성을 가능케했습니다. 그러나 원하는 시각적 속성을 정확하게 설명하는 것은 미술과 사진에 대한 비전문가들에게는 어려울 수 있습니다. 직관적인 해결책은 원본 이미지에서 선호되는 속성을 채택하는 것입니다. 현재 방법은 원본 이미지로부터 정체성과 스타일을 추출하려고 합니다. 그러나 "스타일"은 질감, 색상 및 예술적 요소를 포함하지만 조명 및 다이내믹스와 같은 다른 중요한 속성을 다루지 않습니다. 게다가, 단순화된 "스타일" 적응은 서로 다른 소스에서 여러 속성을 결합하여 하나의 생성된 이미지로 만들지 못하게 합니다. 본 연구에서는 사진의 미학을 구체적인 시각적 속성으로 분해하여 사용자가 다른 이미지에서 조명, 질감 및 다이내믹스와 같은 특성을 적용할 수 있도록 더 효과적인 방법을 제안합니다. 이 목표를 달성하기 위해, 우리는 우리가 알기로는 처음으로 세분화된 시각적 속성 데이터셋 (Fine-grained Visual Attributes, FiVA)을 구축했습니다. 이 FiVA 데이터셋은 시각적 속성을 위한 잘 구성된 분류법을 갖추고 시각적 속성 주석이 달린 약 1백만 장의 고품질 생성된 이미지를 포함합니다. 이 데이터셋을 활용하여 우리는 하나 이상의 원본 이미지로부터 시각적 속성을 분리하고 적응하는 Fine-grained Visual Attribute Adapter (FiVA-Adapter)를 제안합니다. 이 방법은 사용자 친화적인 사용자 정의를 향상시켜 사용자가 고유한 선호도와 특정 콘텐츠 요구 사항을 충족하는 이미지를 만들기 위해 원하는 속성을 선택적으로 적용할 수 있도록 합니다.

English

Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.

FiVA: 텍스트에서 이미지로의 확산 모델을 위한 세밀한 시각적 속성 데이터셋

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

초록

Support